Abstract
In this paper, we suggest a new technique for soft clustering of multidimensional data. It is based on a new convex voting model, where each voter chooses a party with certain probability depending on the divergence between his/her preferences and the position of the party. The parties can react on the results of polls by changing their positions. We prove that under some natural assumptions this system has a unique fixed point, providing a unique solution for soft clustering. The solution of our model can be found either by imitation of the sequential elections, or by direct minimization of a convex potential function. In both cases, the methods converge linearly to the solution. We provide our methods with worst-case complexity bounds. To the best of our knowledge, these are the first polynomial-time complexity results in this field.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
1.1 Motivation
Cluster analysis is one of the most important and natural tools in Data Mining. It was immediately addressed with the first steps in the computer sciences. It is enough to mention the first publications in the middle of twentieth century (see Forgy 1965; Lloyd 1057; MacQueen 1967; Steinhaus 1957). The first proposed algorithms had a combinatorial nature. Given the set of observations \(v_i \in {\mathbb {R}}^m, i = 1, \ldots , N\), and pre-defined number of clusters K, we can define the total variance of the partition:
where \({\mathcal {S}} = \{ S_1, \ldots , S_{K})\) is a partition of the set \(\{1, \ldots , N \}\), and the center of kth cluster is given by a least-square estimate:
Thus, it looks natural to find the best clustering by solving the following problem:
The first combinatorial greedy method for minimizing the function \(\text{ Var }(\cdot )\) was proposed by Lloyd (1057) and reinvented by Forgy (1965). The algorithm stops when further improvement of the objective function is no longer possible. This approach is usually addressed as hard K-means clustering.
However, it was discovered soon in Garey et al. (1982) that the problem (1.3) is NP-hard even for \(K = 2\) (see also Aloise et al. 2009; Dasgupta and Freund 2009; Kleinberg et al. 1998). In order to get rid of the combinatorial nature of problem (1.3), in Dunn (1973), Bezdek (1981), it was suggested to use fuzzy (or soft) clustering. In this approach, the positions of the centers \(C = (c_1, \ldots , c_K)\) of the clusters become variables. At the same time, each element \(v_i\) participates in cluster k up to a certain degree \(p_i^{(k)}\). The smaller this membership value is, the lower should be the probability that the item i must be attributed to cluster k. Usually, the dependence of these values on the central positions of the clusters is given by the following expressions:
where the parameter \(\sigma > 1\) is called the fuzzifier of the system. In the absence of a reliable information on its reasonable value, the usual choice is \(\sigma = 2\).
Thus, in order to compute so-called C-means soft clustering, we need to solve the following nonlinear minimization problem:
For \(\sigma > 1\), this can be done by the standard algorithms of general nonlinear optimization. However, note that function \(\text{ Var}_{\pi }(\cdot )\) can have very complicated topological structure. So, the only possible theoretical guarantee for these algorithms is the convergence to local minima, which are most probably not unique.
At this moment, there exist hundreds of different algorithms for computing hard and soft clusterings (see, for example, the surveys Bora and Gupta 2014; Peters et al. 2013; Xu and Wunsch 2005; Yang 1993). However, to the best of our knowledge, all of them a heuristic. None of them is supported by rigorous theoretical statements on its efficiency and the quality of the generated solutions. Such a situation does not look very striking since both variants of clustering problems (1.3) and (1.4) look computationally difficult in view of their non-convexity.
The main goal of this paper is to show that a small change in the problem setting makes our goal computationally tractable. In order to explain the origin of our approach, let us try to find some successful examples of clustering in the big systems, existing in real life.Footnote 1 For our purposes, the most interesting could be a proper understanding of electoral procedures in modern democratic states.
Since the political parties must reflect the aggregate preferences of a big group of voters, this is indeed a good example of clustering in real life—which has already proved its efficiency by two hundred years of elections in the USA. Note that this procedure should have an enormous stabilizing effect, which allows the country to recover quickly after external and internal shocks and remain functioning in accordance with the interests and experience of the society.Footnote 2
In Operations Research, behavior of a stable system is usually explained by existence of a convex potential function, which must be minimized by the natural evolution of the system. Thus, one of the main goals of this paper is a construction of a convex model for sequential elections, which can help in soft clustering of voters, taking into account their internal preferences. In our model, the voter is characterized by a set of opinions (features), which can be compared with the positions of parties. The opinions of voters are fixed. However, the positions of parties may be changed in accordance with observed opinions of attracted voters. At the same time, party cannot go too far from its basic declarations (core values). For each voter, we compute the probabilities for selecting each party. These probabilities are adjusted after each round of consecutive elections. Thus, in our approach probabilities justify a computational tool for approaching a good clustering.
Our main result is that under some natural behavioral assumptions, the electoral system is stable. This means that the sequential election procedure converges to a unique fixed point. Moreover, this convergence is fast (linear). Thus, we get a computationally efficient algorithm for computing soft clusters. It has a form of an alternating minimization method for a convex potential function. From the technical point of view, the main novelty of our approach consists in a complete elimination of the least squares principle [compared with (1.1), (1.2), and (1.4)]. Instead, the membership values are computed, for example, as
where \(\mu > 0\) is a volatility coefficient. Note that the norms in this expression are not squared.
1.2 Contents
The paper is organized as follows: In Sect. 2, we develop all necessary mathematical tools for analyzing the global convergence of an alternating minimization scheme as applied to a bivariate potential function with a multiplicative interaction term. We prove sufficient conditions for its convexity and establish a linear rate of convergence for alternating minimization.
In Sect. 3, we describe our electoral model and introduce the main behavioral assumptions for the voters and political parties. We define the stable systems as systems for which the sequential election procedure has a unique fixed point. In Sect. 4, we justify the conditions for stability, using bivariate potential function studied in Sect. 2. It is important that for measuring divergence in the opinions of voters and positions of the parties we can use an arbitrary Lipschitz continuous function. In any case, the rate of convergence of the whole process is linear.
In Sect. 5, we rewrite the stability condition of the system in terms of the individual tolerance of the party and discuss the computational cost of the proposed model. In the simplest case of measuring the divergence in the opinions by a usual norm, this cost is proportional to \(O({1 \over \epsilon ^2})\), where \(\epsilon \) is the desired accuracy of the computed solution. However, we can use smooth approximations of the norms, or distances based on Huber function. In this case, the computational cost of our model is reduced. It becomes proportional to \(\ln {1 \over \epsilon }\) multiplied by some factors dependent on the parameters of our model.
In Sect. 6, we analyze a clustering scheme based on a direct minimization of the bivariate potential. For doing that, we replace the probabilistic assignment rule (1.5) by a Euclidean projection onto the standard simplex. As a consequence, we can apply to our model the standard gradient methods of smooth convex optimization with linear rate of convergence. Consequently, we get the worst-case estimate for the total computational expenses of the order \(O(KmN \ln {1 \over \epsilon })\) arithmetic operations. This is the fastest clustering scheme in our paper. Recall that for the best of our knowledge, in our paper we present the first complexity results in this field.
1.3 Notation and generalities
In this paper, we denote by \({\mathbb {E}}\) with an optional subscript a finite-dimensional real vector space, and by \({\mathbb {E}}^*\) the corresponding dual space. For linear function \(s \in {\mathbb {E}}^*\), denote its value at \(x \in {\mathbb {E}}\) by \(\langle s, x \rangle _{{\mathbb {E}}}\). If no ambiguity arise, the subscript is usually omitted. For any norm \(\Vert \cdot \Vert _{{\mathbb {E}}}\) in \({\mathbb {E}}\), we define its dual norm in the standard way,
which ensures validity of Cauchy–Schwartz inequality:
If \({\mathbb {E}}= {\mathbb {R}}^n\), the space of real column vectors \(x = (x^{(1)}, \ldots , x^{(n)})^{\mathrm{T}}\), then \({\mathbb {E}}^* = {\mathbb {R}}^n\) and
Notation \({\mathbb {R}}^n_+\) is used for the positive orthant:
and by \(\Delta _n\) we denote the standard simplex:
The standard notation is used for \(\ell _p\) norms:
with \(\Vert x \Vert _{\infty } = \max \limits _{1 \le i \le n} | x^{(i)} |\).
For function \(f(x), x \in {\mathbb {E}}\), we denote by \(\nabla f(x) \in {\mathbb {E}}^*\) its gradient at x. If f is a non-differentiable convex function, the same notation is used for its subgradient. If function \(f(\cdot , \cdot )\) has two arguments, notation \(\nabla _2 f(x,y)\) corresponds to its gradient with respect to variable y. Finally, for twice differentiable function f we denote by \(\nabla ^2 f(x)\) its Hessian at x. For x being fixed, this is a linear operator from \({\mathbb {E}}\) to \({\mathbb {E}}^*\).
Recall that function f is called strongly convex on convex set \(Q \subseteq {\mathbb {E}}\) with convexity parameter \(\sigma > 0\) if for all \(x, y \in Q\) and \(\alpha \in [0,1]\) we have
The most important property of strongly convex function is that it attains a unique minimum on any closed convex set Q. Therefore, if \(x^* = \arg \min \nolimits _{x \in Q} f(x)\), then for any \(x \in Q\) and \(\alpha \in (0,1)\) we have
Dividing the difference of the right- and left-hand side of this inequality \((1-\alpha )\) and tending \(\alpha \) to one, we get
2 Bivariate potential with multiplicative interaction cost
Let \({\mathbb {E}}, {\mathbb {E}}_1\), and \({\mathbb {E}}_2\) be three finite-dimensional real vector spaces. Consider the following function of two variables:
where functions f and h are closed and convex on their domains. Our main structural assumption on function \(\Phi \) is that the interaction term g has a multiplicative form:
In Sect. 4, we will see an application example for a potential having exactly this structure.
The main goal of this section consists in developing natural conditions for convexity of function \(\Phi \) and analyzing the performance of alternating minimization approach for solving the corresponding optimization problems.
Denote by \(Q_1\) and \(Q_2\) the closed convex feasible sets in the spaces \({\mathbb {E}}_1\) and \({\mathbb {E}}_2\), respectively. And let \(Q = Q_1 \times Q_2\).
Assumption 1
For any \(x \in Q_1\) function \(g(x, \cdot ) = \langle G_1(x), G_2(\cdot ) \rangle \) is closed and convex on \(Q_2\), and for any \(p \in Q_2\) function \(g(\cdot ,p) = \langle G_1(\cdot ), G_2(p) \rangle \) is closed and convex on \(Q_1\).
In the remaining part of this section, we always assume that Assumption 1 is satisfied.
Example 1
Assumption 1 is valid, for example, for linear operators:
Another important example is \(g(x,p) = \langle G_1(x), p \rangle \), where \(p \in {\mathbb {R}}^n_+\) and all components of the vector function \(G_1(x)\) are convex.
Assumption 1 has the following important consequence.
Lemma 1
For any two points \(z_0 = (x_0,p_0)\) and \(z_1 = (x_1,p_1)\) from Q and any \(\alpha \in [0,1]\), we have
Proof
Denote \(z_{\alpha } = (x_{\alpha }, p_{\alpha }) = (1-\alpha )z_0 + \alpha z_1\). Then in view of Assumption 1 we have
On the other hand, by the same reason
Putting all inequalities together, we get (2.3). \(\square \)
Now we can justify a sufficient condition for convexity of the potential \(\Phi (\cdot ,\cdot )\).
Theorem 1
Let function f be strongly convex on \(Q_1\) with parameter \(\sigma _1\), and function h be strongly convex on \(Q_2\) with parameter \(\sigma _2\). Assume that the operators \(G_1(\cdot )\) and \(G_2(\cdot )\) are Lipschitz continuous:
If the constants \(L_1\) and \(L_2\) are small enough, namely, if
then function \(\Phi \) is convex on \(Q = Q_1 \times Q_2\). If inequality (2.5) is strict, then \(\Phi \) is strongly convex on Q.
Proof
Let us fix two points \(z_0 = (x_0,p_0)\) and \(z_1 = (x_1,p_1)\) from Q. Consider the intermediate point \(z_{\alpha } = (1-\alpha )z_0 + \alpha z_1\) with \(\alpha \in [0,1]\). Then, since \(p_{\alpha } \in Q_2\), we have
where \(\delta (z_0,z_1) = {\sigma _1 \over 2} \Vert x_1 - x_0 \Vert ^2 +{\sigma _2 \over 2} \Vert p_1 - p_0 \Vert ^2 + \langle G_1(x_1) -G_1(x_0), G_2(p_1) - G_2(p_0) \rangle \). Assuming now that \(L_1^2 L_2^2 \le (\sigma _1 - \epsilon )(\sigma _2-\epsilon )\) for some small \(\epsilon \ge 0\), we have
Thus, if \(\epsilon > 0\), then function \(\Phi \) is strongly convex in view of definition (1.8). If \(\epsilon = 0\), then \(\Phi \) is just a convex function. \(\square \)
Simple example with \(Q_1 = {\mathbb {R}}^n, Q_2 = {\mathbb {R}}^m\), and
shows that the condition (2.5) cannot be improved. In this example, \(L_1 = \Vert A \Vert \) and \(L_2 = 1\).
Consider the following minimization problem:
where the parameters of function \(\Phi \) satisfy a strict variant of condition (2.5). In this case, function \(\Phi \) is strongly convex. Consequently, there exists a unique solution
of the problem (2.6). Let us show that it can be found by the following alternating minimization scheme.
In order to analyze the convergence of this scheme, let us define the following operators:
In view of Assumption 1, the objective function in both optimization problems above is strongly convex. Hence, the points \(u(\cdot )\) and \(v(\cdot )\) are well defined.
Lemma 2
Let the conditions of Theorem 1 be satisfied. Then, for any \(x_1, x_2 \in Q_1\) we have
Similarly, for any \(p_1\) and \(p_2\) from \(Q_2\) we have
Proof
Indeed, point u(x) is the solution of the problem \(\min \nolimits _{p \in Q_2} \{ \langle G_1(x), G_2(p) \rangle + h(p) \}\) with strongly convex objective function. Therefore, in view of the property (1.9), we have
Adding these two inequalities, we get
Thus, in view of inequality (2.4), we get (2.8). The proof of Inequality (2.9) follows by the same argument. \(\square \)
The following theorem is a trivial consequence of Lemma 2. Define
Theorem 2
Let \(\lambda {\mathop {=}\limits ^{\mathrm {def}}}{L_1^2 L_2^2 \over \sigma _1 \sigma _2} < 1\). Then, \(T(\cdot )\) and \(S(\cdot )\) are contracting mappings:
Note that in terms of operators T and S the process (2.7) can be written in the following way:
Therefore, for any \(t \ge 0\), we have
A similar rate of convergence can be established for the sequence \(\{ p_k \}_{k \ge 0}\). We conclude that the process (2.7) converges linearly to the solution of problem (2.6).
Note that from the viewpoint of standard optimization theory, problem (2.6) has possibly non-differentiable objective function with unbounded derivatives (this is allowed by Assumption 1). It has the only favorable property, the strong convexity of the objective. However, this is not enough for justifying linear rate of convergence of any standard optimization scheme.
In the next section, we consider an application example, where the equilibrium state of the system can be found by minimizing bivariate potential with a multiplicative interaction term.
3 Electoral model
Let us show how we can use the machinery developed in Sect. 2 for justifying a soft clustering model based on a stable electoral procedure. The basic elements of our model are interpreted as independent voters possessing some features (opinions). They must be attached to different clusters (political parties), which represent these features in the best possible way.
In our model, we have N independent voters and K political parties. Our main assumption is that the voting results are random. Voter i decides to vote for party k with probability \(p_i^{(k)}\). It is convenient to put these probabilities in a vector
These vectors can be unified in a matrix \(P = (p_1, \ldots , p_N) \in {\mathbb {R}}^{K \times N}\), which we call the voting matrix. At the beginning of the voting process, this matrix is unknown. However, it will be computed as an outcome of a sequence of consecutive elections.
Let us try to explain the results of elections by some quantitative parameters. We assume that an opinion of voter i can be described by m different real values (personal preferences), which we put in vector \(v_i \in {\mathcal {V}} \subseteq {\mathbb {R}}^m, i = 1, \ldots , N\), where \({\mathcal {V}}\) is a closed convex set (e.g., \({\mathcal {V}} = {\mathbb {R}}^m_+\)). These vectors are fixed during the whole history of consecutive elections.
At the same time, positions \(x_k \in {\mathcal {V}}\) of political parties are flexible, \(k = 1, \ldots , K\). After each round of elections, these values can be adjusted for better representing the positions of the voters closely attached to this party. It will be convenient to keep these vectors in a matrix \(X = (x_1, \ldots , x_{K}) \in {\mathbb {R}}^{m \times K}\).
Let us fix some distance function \(\rho (v,x) \ge 0\), which is used for measuring the distance between the opinion \(v \in {\mathcal {V}}\) of a voter and current position \(x \in {\mathcal {V}}\) of a political party. In what follows, we always assume that for any v being fixed the function \(\rho (v, \cdot )\) is convex. A natural choice for this function would be \(\rho (v,x) = \Vert x - v \Vert \), where \(\Vert \cdot \Vert \) is an arbitrary norm in \({\mathbb {R}}^m\). However, in Sect. 5, we will give more examples with some motivation for their use.Footnote 3
Clearly, the bigger the distance between v and x is, the smaller should be the probability that this party will be selected by this particular voter. In our electoral model, for this decision we apply the discrete choice probabilities of logit model (e.g., Anderson et al. 1992).
Assumption 2
Voter i selects the kth party with probability
where \(\mu \ge 0\) is the flexibility parameter, which represents the volatility of opinions of voters.
Denote by \(P_*(X)= (p_1(X), \ldots , p_N(X))\) the corresponding voting matrix.
In Assumption 2, value \(\mu = 0\) corresponds to the deterministic choice: the voter always chooses a party, which is the closest to his/her opinion. However, usually this parameter is strictly positive.
It is important that the probability vector \(p_i(X)\) has an optimization interpretation. Consider the entropy function
It is easy to check (see, for example, Lemma 4 in Nesterov 2005a) that
where \(g_i(X) = (\rho (v_i,x_1), \ldots , \rho (v_i,x_{K}))^{\mathrm{T}}\). Note that function \(\eta (\cdot )\) is strongly convex on the standard simplex. Indeed, for any \(p \in \mathrm{int \,}\Delta _{K}\) and \(h \in {\mathbb {R}}^{K}\) we have
At the same time,
Thus, the entropy function is strongly convex on \(\Delta _{K}\) in \(\ell _1\)-norm with convexity parameter one.
It remains to describe the behavior of political parties. Each party is able to modify its position in order to attract the maximal number of voters. However, they should not go too far from their core values, which we denote by \(c_k \in {\mathcal {V}}, k=1, \ldots , K\). In order to measure the distance from the current position of the party and its core value, we introduce a prox-function d(x, y). It must satisfy the following conditions:
-
\(d(x,y) \ge 0\) for all \(x,y \in {\mathcal {V}}\).
-
For each \(x \in {\mathcal {V}}\), function \(d(x,\cdot )\) is strongly convex in the second argument with convexity parameter one:
$$\begin{aligned} d(x,v)\ge & {} d(x,y) + \langle \nabla _2 d(x,y), v - y \rangle +{1\over 2} \Vert v - y \Vert ^2,\nonumber \\&\forall v,y \in {\mathcal {V}}, \end{aligned}$$(3.4)
where \(\Vert \cdot \Vert \) is an arbitrary norm in \({\mathbb {R}}^m\).
Let us give several examples of the most important prox-functions.
-
Kullback–Leibler divergence
$$\begin{aligned} d_1(x,y)= & {} \eta (y) - \eta (x) - \langle \eta (x), y - x \rangle \\= & {} \sum \limits _{k=1}^{K} y^{(k)} \ln {y^{(k)} \over x^{(k)}}, \quad x, y \in \Delta _{K}. \end{aligned}$$This function is strongly convex in y in \(\ell _1\)-norm.
-
Euclidean distance \(d_2(x,y) = {1\over 2} \Vert x - y \Vert ^2_2, x, y \in {\mathbb {R}}^{K}\). This function is strongly convex in \(\ell _2\)-norm.
-
We can take \(\tilde{d}_i(x,y) = d_i(x,y) + \epsilon \Vert x -y \Vert _i\) with \(\epsilon \ge 0, i =1, 2\). The additional linear term gives to a party more chances to keep its core values unchanged.
The behavior of political parties is described by the following assumption.
Assumption 3
For a given voting matrix \(P = ( p_1, \ldots , p_{N} ) \in \Delta _{K}^N\), each political party chooses its optimal current position by minimizing the function
in \(x_k \in {\mathcal {V}}\).
In this definition, \(\tau > 0\) is a group tolerance parameter.Footnote 4 The objective function (3.5) has an interpretation of the expected distance between the opinions of all attracted voters and the current position of the kth party, augmented by the discrepancy with its core values.
Since \(\psi _k(P, \cdot )\) is strongly convex, there exists its unique minimum \(x^*_k(P)\) over \({\mathcal {V}}\). Denote \(X_*(P) = (x_1^*(P), \ldots , x^*_N(P))\).
Consider now the following process of sequential elections:
The interpretation of process (3.6) is straightforward:
Given the current positions of political parties \(X_t\), voters announce their preferences \(P_{t+1}\) during the electoral poll. After observing the results, parties update their positions \(X_{t+1}\) for the next elections.
Definition 1
Electoral system is called stable if the process (3.6) has a unique limiting point, which is independent on the starting position \(X_0 \in {\mathcal {V}}^{K}\).
In the next section, we derive some sufficient conditions of the electoral stability. Note that in our model we have two tolerance parameters, \(\mu \) and \(\tau \).
4 Stable electoral systems
Let us describe the electoral process (3.6) using the framework of Sect. 2. Denote \({\mathbb {E}}_1 = {\mathbb {R}}^{m \times K}\) the space for positions \(X = (x_1, \ldots , x_{K})\) of the parties. Then, \(Q_1 = {\mathcal {V}}^{K}\). For the voting matrix \(P=(p_1, \ldots , p_N)\) we introduce the space \({\mathbb {E}}_2 = {\mathbb {R}}^{K \times N}\) and the feasible set \(Q_1 = \Delta _{K}^N\). For the interaction term, we have \({\mathbb {E}}= {\mathbb {E}}_2, G_2(P) \equiv P\), and \(g(X,P)= \langle G_1(X), P \rangle _{{\mathbb {E}}_2}\), where the distance matrix
is component-wise convex in X. Since all matrices \(P \in Q_2\) have nonnegative elements, function \(g(\cdot ,\cdot )\) satisfy Assumption 1.
In view of the rules (3.2) and (3.5), the electoral process (3.6) can be seen as alternating minimization scheme (2.7) being applied to the following bivariate potential
Now we need to choose appropriate norms in \({\mathbb {E}}_1\) and \({\mathbb {E}}_2\), which ensure good values of convexity parameters for function f and h. For \(X \in Q_1\) and direction \(U \in {\mathbb {E}}_1\), we have
Thus, the natural choice of the norm for \({\mathbb {E}}_1\) is as follows:
In this case, the convexity parameter of function f is one, and we can choose \(\sigma _1 = {1 \over \tau }\).
For \(P \in Q_2\) and direction \(H \in {\mathbb {E}}_2\), we have
Therefore, the natural choice for the norm in \({\mathbb {E}}_2\) is
In this case, function h has convexity parameter one and we can choose \(\sigma _2 = \mu \).
It remains to estimate the Lipschitz constants for inequality (2.4). In our case, \(L_2 = 1\). For estimating \(L_1\), we need to compute first the dual norm \(\Vert \cdot \Vert ^*_{{\mathbb {E}}_2}\). For arbitrary \(S \in {\mathbb {R}}^{m \times N}\), we have
Thus, we get the following dual norm:
Now we can get the Lipschitz constant of the mapping \(G_1(\cdot )\). For that, we need one more assumption.
Assumption 4
For any \(v \in {\mathcal {V}}\), function \(\rho (v,\cdot )\) is Lipschitz continuous in its second argument with constant one.
Then, for X and Y from \(Q_1\), we have
Note that
Therefore, in view of Assumption 4,
Hence,
Thus, mapping \(G_1\) is Lipschitz continuous with constant \(L_1=N^{1/2}\). Now, using Theorem 2, we come to the following statement.
Theorem 3
Let the behavior of voters and political parties satisfy Assumptions 2, 3 with tolerance parameters
Then, the corresponding electoral system is stable. Moreover, for the stationary voting matrix \(P_* = \lim \nolimits _{t \rightarrow \infty } P_t\), where matrices \(\{P_t \}\) are generated by the process (3.6), we have
Proof
Note that condition (4.3) is a strict version of condition (2.5). In terms of Theorem 2, we have \(P_{t+1} = S(P_t)\) and
Note that for any \(p_1, p_2 \in \Delta _{K}\) we have \(\Vert p_1 - p_2 \Vert _1 \le 1\). Therefore, for any \(P \in Q_2\) it holds \(\Vert P - P_* \Vert \le N^{1/2}\). Thus, by Theorem 2, for all \(k \ge 1\) we get \(\Vert P_{t} - P_* \Vert \le \lambda ^t N^{1/2}\). \(\square \)
5 Computational aspects of alternating minimization
Let us find an interpretation of the stability condition (4.3). For that, we introduce the individual tolerance parameter
Then, the optimization problem (3.5), defining the response of the party on the voting probabilities, can be rewritten as
This means that each party reacts on the average opinion of the attracted voters. With this notation, the stability condition \(\lambda < 1\) can be rewritten as
In view of its importance, we rewrite this condition in form of a theorem.
Theorem 4
Electoral system is stable if the individual tolerance of parties is smaller than the volatility of voters.
This condition looks indeed natural for guaranteeing stability of electoral system: parties must be more conservative in changing their opinions than the voters. In other words, an excessive populism of parties may cause political instability.
Let us discuss now the computational complexity of the electoral‘ process. For simplicity, we assume that the set \({\mathcal {V}}\) is bounded: \(D {\mathop {=}\limits ^{\mathrm {def}}}\max \nolimits _{x, y \in {\mathcal {V}}} \Vert x - y \Vert <\infty \). Each step in the recurrence (3.6) consists of two time-consuming operations.
-
Computation of the optimal voting matrix \(P_*(X)\). For that, we need to compute first the distance matrix \(G_1(X)\). If the cost of computation of the value \(\rho (v,x)\) is O(m) operation, then the distance matrix can be computed in O(KmN) operations. After that, we can apply the rule (3.1), which is cheaper: O(KN) operations. Thus, in total we need
$$\begin{aligned} T_1 = O(K m N) \end{aligned}$$operations.
-
Computation of new positions of the parties \(X_*(P)\). For that, we need to solve K optimization problems (5.1). Note that just one computation of the objective functions for all these problems needs O(KmN) operations. Therefore, the importance of this step for our approach is dominating. Complexity of problem (5.1) depends on efficiency of the applied optimization schemes. And their potential efficiency depends on the properties of function \(\rho (v,\cdot )\). Let us discuss these aspects in more details. Note that the objective function in this problem is strongly convex with parameter
$$\begin{aligned} \hat{\sigma } = {1 \over \hat{\tau }}. \end{aligned}$$
In our approach, in view of Assumption 4, we cannot follow the standard advice to use
Recall that this choice is mainly motivated by an extremely low cost of solving the problem (5.1) (only O(mN) operations). However, we pay for that by non-convexity of the potential, existence of many local minima and most probably by NP-hardness of finding the global solution.
Thus, we need to choose a Lipschitz continuous distance function. Let us look at the most reasonable variants.
-
1.
Arbitrary norm \(\rho _1(v,x) = \Vert v - x \Vert \). This function clearly satisfies Assumption 4. With this choice, problem (5.1) becomes a problem of convex optimization with non-differentiable strongly convex objective function. From the general theory, we know that \(\epsilon \)-solution of this problem can be found in
$$\begin{aligned} O\left( {\hat{\tau } \over \epsilon } M^2 \right) \end{aligned}$$(5.3)iterations, where M is an upper bound for the norms of subgradients gradients of function \(\hat{\psi }_k(P,\cdot )\), and \(\epsilon \) is the required accuracy of solving the problem (5.1). Since the objective function of this problem has implicit structure, we can apply the smoothing technique (see Nesterov 2005a, b) for finding an \(\epsilon \)-solution of this problem in \(O\left( \sqrt{\hat{\tau } \over \epsilon } M \right) \) iterations. This bound is much better than (5.3). In order to estimate the total computational time, we need to multiply this bound by K, the number of parties, and by mN, the complexity of computing the value and subgradient of function \(\hat{\psi }_k(P,\cdot )\). Thus, we get the following estimate for computational time:
$$\begin{aligned} T_2 = \sqrt{\hat{\tau } \over \epsilon } M \cdot K m N. \end{aligned}$$Note that this bound depends on two parameters, which can be potentially big. The accuracy \(\epsilon \) must be chosen proportionally of the required accuracy of approximation of matrix \(P_*\).Footnote 5 Thus, the impact of the term \({1 \over \sqrt{\epsilon }}\) can be significant. The constant M can be estimated from above as follows:
$$\begin{aligned} M&= \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2 \hat{\psi }_k(P,x)\Vert ^* \\&{\mathop {\le }\limits ^{(5.1)}} \; {1 \over N} \sum \limits _{i=1}^N p_i^{(k)} + {1 \over \hat{\tau }} \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2d(c_k,x) \Vert ^*\\&\le 1 + {1 \over \hat{\tau }} \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2d(c_k,x) \Vert ^*. \end{aligned}$$Thus, M can be big if the diameter D is big.
-
2.
Smoothed Euclidean norm \(\rho _2(v,x) = \sqrt{\delta ^2 + \Vert v - x \Vert ^2_2}\), where \(\delta > 0\) is a smoothing parameter. In this case, it is natural to choose \(d(c,x) ={1\over 2} \Vert x - c \Vert _2^2\). Note that \(\rho _2(v,\cdot )\) is infinitely times differentiable in its second argument. Moreover,
$$\begin{aligned} \nabla _2\rho (v,x)&= {x - v \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{1/2}}, \quad \Vert \nabla _2 \rho (v,x) \Vert _2 \; \le \; 1,\nonumber \\ \nabla ^2_2 \rho (v,x)&= {I \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{1/2}} -{(x - v)(x-v)^{\mathrm{T}} \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{3/2}} \; \preceq \; {1 \over \delta } I. \end{aligned}$$(5.4)Thus, function \(\rho _2(v,\cdot )\) satisfies Assumption 4, and it has Lipschitz continuous gradient with constant \({1 \over \delta }\). Consequently, function \(\hat{\psi }_k(c_k,\cdot )\) has Lipschitz continuous gradient with the constant \(L = {1 \over \delta } + {1 \over \hat{\tau }}\). Since this function is strongly convex with parameter \({1 \over \hat{\tau }}\), we conclude that its condition number is equal to
$$\begin{aligned} \kappa = L/ \hat{\sigma } \; = \; 1 + {\hat{\tau } \over \delta }. \end{aligned}$$Thus, problem (5.1) can be solved by the usual gradient method in \(O\left( \kappa \ln {LD^2 \over \epsilon } \right) \) iterations. For the fast gradient methods, we get the bound of \(O\left( \kappa ^{1/2} \ln {LD^2 \over \epsilon } \right) \) iterations (see Sections 2.1 and 2.2 in Nesterov 2004). For the latter case, we get the following estimate of computational time:
$$\begin{aligned} T_3 = O\left( \sqrt{ 1 + {\hat{\tau } \over \delta } } \ln {LD^2 \over \epsilon } \cdot K m N \right) . \end{aligned}$$(5.5)Clearly, this estimate is much better than \(T_2\).
Note that Euclidean norm can be smoothed in many different ways. From the statistical point of view, sometimes it is reasonable to apply the Huber function which is used in robust regression for better resistance to outliers:
$$\begin{aligned} \rho _H(v,x)= & {} \chi _{\delta }(\Vert x - v \Vert _2), \\ \chi _{\delta }(\tau )= & {} \left\{ \begin{array}{rl} \tau - {1\over 2} \delta , &{} \text{ if }\ \tau >\delta , \\ {1 \over 2 \delta } \tau ^2, &{} \text{ otherwise, } \end{array} \right. \quad \tau \ge 0. \end{aligned}$$It has the same properties as function \(\rho _2(v,\cdot )\) with the same estimate \(T_3\) for the computational time of one iteration of the process (3.6).
Recall that for estimating the total time, which is necessary for procedure (3.6) to converge to an approximate solution of problem
we need to multiply the bounds \(T_i\) by an estimate for the total number of steps of the scheme. In view of Theorem 3, in order to get \(\Vert P_k - P_* \Vert _{{\mathbb {E}}_2} \le \epsilon _P\), we need at most
iterations of the process (3.6).
6 Simplified soft clustering
In the previous section, we derived complexity bounds for the iterative process (3.6), which has interpretation of the sequential election procedure. However, the best complexity estimate (5.5), (5.7) is still quite heavy. One of the possibilities for improving the complexity bounds consists in considering direct methods for solving the minimization problem (5.6). Note that in its current form this problem is not simple. The reason is that function \(h(\cdot )\) in (4.1) has unbounded derivatives at the feasible set. Therefore, in this section we study a possibility of replacing it by a squared Euclidean norm. Of course, in this case we will loose the probabilistic interpretation of the soft assignment (3.1). However, from the viewpoint of clustering, the new potential is still useful for computing the attachment coefficients \(p_i^{(k)}(X)\).
Consider the following minimization problem:
where \(f(X) = {1\over 2} \sum \nolimits _{k=1}^{K} \Vert x_k - c_k \Vert ^2_2, h(P) = {1\over 2} \sum \nolimits _{i=1}^N \Vert p_i \Vert ^2_2\), andFootnote 6
This potential has three positive parameters \(\hat{\tau }, \mu \) and \(\delta \) (see Item 2 above), which we assume to satisfy the following relations:
Let us prove that under these conditions the potential \(\Phi _2(\cdot ,\cdot )\) is a strongly convex function with Lipschitz continuous gradient.
Let us fix a direction \(Z = (U,H) \in {\mathbb {E}}= {\mathbb {E}}_1 \times {\mathbb {E}}_2\) and choose
We need to estimate from above and below the second derivative of the potential \(\Phi _2(X,P)\) along direction Z. Note thatFootnote 7
Since \(P \ge 0\), we have
At the same time,
Thus,
On the other hand, since \(P \ge 0\) and function \(\rho _2(v_i, \cdot )\) is convex, we have
In view of (6.2), we can choose in \({\mathbb {E}}\) the following Euclidean norm:
As we have seen, with respect to this norm this norm, potential \(\Phi _2\) is strongly convex on \({\mathbb {E}}\) with convexity parameter one. On the other hand, it has Lipschitz continuous gradient with constant
Consequently, the condition number of this function is \(\kappa _2 =L_2\).
If there is no other reason for choosing the parameter \(\delta \), satisfying condition (6.2), we can try to use this degree of freedom for making the value \(\kappa _2\) as small as possible. For that, it is convenient to introduce another representation of our parameters. Let us choose two factors \(\gamma \) and \(\gamma _1\) from (0, 1) and define
Then, \({\delta + 2 \hat{\tau } \over \delta - \hat{\tau }} = {\mu +\delta \over \mu - \delta }\) if and only if \({1 + 2 \gamma _1 \over 1 - \gamma _1} = {1+\gamma \over 1 - \gamma }\). Thus, we can choose \(\gamma _1 = {2 \gamma \over 3 - \gamma }\). Then,
Thus, under the choice of parameters (6.8), problem (6.1) is very easy for the gradient methods. It can be solved by the usual gradient method up to accuracy \(\hat{\epsilon }\) in the function value in
iterations, where \(\delta _f\) is the initial residual in function value. If we apply the fast gradient methods, the efficiency estimate is even better:
(see Section 2 in Nesterov 2004). Note that the cost of each iteration of such schemes is O(KmN) operations. Thus, we get indeed very efficient clustering procedures.
Let us look at the rule of forming the soft membership vectors \(p_i(X) \in \Delta _{K}, i = 1, \ldots N\), imposed by the potential \(\Phi _2\). For that, given the position matrix X, we need to minimize the potential \(\Phi _2(X,\cdot )\):
This means that for each element \(i, 1 \le i \le N\), we need to solve the problem
where \(g_i^{(k)}(\bar{X}) = \rho _2(v_i,\bar{x}_k), k = 1, \ldots , K\). Note that the objective function in the latter problem is as follows:
Thus, the solution of this problem is a Euclidean projection of the point
onto the standard simplex \(\Delta _{K}\). This point can be found in \(O(K \ln K)\) operations since we need to order the entries of vector \(\tilde{p}_i\). If K is not big, the complexity of straightforward ordering \(O(K^2)\) is also acceptable. Note that for \(\mu \rightarrow 0\) the rule (6.9) converges to the deterministic choice of the entry of vector \(g_i(X)\in {\mathbb {R}}^{K}\) with the minimal value.
Notes
It is not the first time when the nature helps in finding an efficient optimization algorithm. May be the most striking example is a derivation of the shortest-path Dijkstra’s algorithm from the least time Fermat principle of propagation of light. However, now we cannot discuss this interesting topic in details.
An indirect proof of this conjecture is given by economic domination of democratic states in the modern world. At the same time, we can present a mathematically correct sufficient condition for stability of democratic elections (see Theorem 4).
Note that our theory does not work for the standard choice \(\rho (v,x) = {1\over 2} \Vert x - v \Vert _2^2\).
Later on in Sect. 5, we will replace this parameter by the individual tolerance. Note that the group tolerance must be smaller than the individual one in order to resist well to a cumulative opinion of a group.
It is possible to relate these two accuracies by a rigorous mathematical condition. But its justification is quite technical and we decided to drop it in this paper.
All our further conclusions are also valid for Huber distance function \(\rho _H(\cdot ,\cdot )\).
For the interaction term of the potential \(\Phi _2\), we use the standard differentiation rule
$$\begin{aligned} (g(x) p)'' = g''(x) (x')^2 p + 2 g'(x) x' p' + g(x) p'', \end{aligned}$$taking into account that \(x' = u, p' = h\) and \(x''=0, p'' =0\).
References
Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–249
Anderson SP, de Palma A, Thisse J-F (1992) Discrete choice theory of product differentiation. MIT Press, Cambridge
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Bora DJ, Gupta AK (2014) A comparative study between fuzzy clustering algorithm and hard clustering algorithm. Int J Comput Trends Technol 10(2):108–113
Dasgupta S, Freund Y (2009) Random projection trees for vector quantization. IEEE Trans Inf Theory 55(7):3229–3242
Dunn JC (1973) A fuzzy relative of the ISODATA process and ts use in detecting well-separated clusters. J Cybern 3(3):32–57
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus inetrpretability of classifications. Biometrics 21:768–769
Garey M, Johnson D, Witsenhausen H (1982) The complexity of the generalized Lloyd-max problem. IEEE Trans Inf Theory 28(2):255–256
Kleinberg J, Papadimitriou C, Raghavan P (1998) A microeconomic view of data mining. Data Min Knowl Disc 2(4):311–324
Lloyd SP (1057) Least square quantization in PMC. Bell Telephone Laboratories Press, Murray Hill
MacQueen JC (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–291
Nesterov Y (2004) Introductory lectures on convex optimization. Kluwer, Boston
Nesterov Y (2005) Smooth minimization of non-smooth functions. Math Program 103(1):127–152
Nesterov Y (2005) Excessive gap technique in nonsmooth convex minimization. SIAM J Optim 16(1):235–249
Peters G, Crespo F, Lingras P, Weber R (2013) Soft clustering—fuzzy and rpugh approaches and their extensions and derivatives. Int J Approx Reason 54:317–322
Steinhaus H (1957) Sur la division des corps matériels en parties. Bull Acad Polon Sci 4(12):801–804 (in French)
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yang M-S (1993) A survey on fuzzy clustering. Math Comput Model 18(11):1–16
Acknowledgements
The author would like to thank three anonymous referees for very useful comments and suggestions.
Funding
This study was funded by the Advanced Grant 788368 of the European Research Council.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by Yaroslav D. Sergeyev.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nesterov, Y. Soft clustering by convex electoral model. Soft Comput 24, 17609–17620 (2020). https://doi.org/10.1007/s00500-020-05148-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05148-4