1 Introduction

1.1 Motivation

Cluster analysis is one of the most important and natural tools in Data Mining. It was immediately addressed with the first steps in the computer sciences. It is enough to mention the first publications in the middle of twentieth century (see Forgy 1965; Lloyd 1057; MacQueen 1967; Steinhaus 1957). The first proposed algorithms had a combinatorial nature. Given the set of observations \(v_i \in {\mathbb {R}}^m, i = 1, \ldots , N\), and pre-defined number of clusters K, we can define the total variance of the partition:

$$\begin{aligned} \text{ Var }({\mathcal {S}}) {\mathop {=}\limits ^{\mathrm {def}}}\sum \limits _{k=1}^{K} \quad \sum \limits _{i \in S_k} \Vert v_i - c_k \Vert ^2_2, \end{aligned}$$
(1.1)

where \({\mathcal {S}} = \{ S_1, \ldots , S_{K})\) is a partition of the set \(\{1, \ldots , N \}\), and the center of kth cluster is given by a least-square estimate:

$$\begin{aligned} c_k = \arg \min \limits _{c \in {\mathbb {R}}^m} \sum \limits _{i \in S_k} \Vert v_i - c \Vert ^2_2 \; = \; {1 \over | S_k|} \sum \limits _{i \in S_k} v_i , \quad k = 1, \ldots , K. \end{aligned}$$
(1.2)

Thus, it looks natural to find the best clustering by solving the following problem:

$$\begin{aligned} \min \limits _{\mathcal {S}} \; \text{ Var }({\mathcal {S}}). \end{aligned}$$
(1.3)

The first combinatorial greedy method for minimizing the function \(\text{ Var }(\cdot )\) was proposed by Lloyd (1057) and reinvented by Forgy (1965). The algorithm stops when further improvement of the objective function is no longer possible. This approach is usually addressed as hard K-means clustering.

However, it was discovered soon in Garey et al. (1982) that the problem (1.3) is NP-hard even for \(K = 2\) (see also Aloise et al. 2009; Dasgupta and Freund 2009; Kleinberg et al. 1998). In order to get rid of the combinatorial nature of problem (1.3), in Dunn (1973), Bezdek (1981), it was suggested to use fuzzy (or soft) clustering. In this approach, the positions of the centers \(C = (c_1, \ldots , c_K)\) of the clusters become variables. At the same time, each element \(v_i\) participates in cluster k up to a certain degree \(p_i^{(k)}\). The smaller this membership value is, the lower should be the probability that the item i must be attributed to cluster k. Usually, the dependence of these values on the central positions of the clusters is given by the following expressions:

$$\begin{aligned} p_i^{(k)}= & {} \pi _i^{(k)}(C) \; {\mathop {=}\limits ^{\mathrm {def}}}\; \left[ \sum \limits _{j=1}^K \left( {\Vert v_i - c_k \Vert _2 \over \Vert v_i - c_j \Vert _2} \right) ^{2 \over \sigma - 1} \right] ^{-\sigma },\\&k = 1, \ldots , K, \quad i = 1, \ldots , N, \end{aligned}$$

where the parameter \(\sigma > 1\) is called the fuzzifier of the system. In the absence of a reliable information on its reasonable value, the usual choice is \(\sigma = 2\).

Thus, in order to compute so-called C-means soft clustering, we need to solve the following nonlinear minimization problem:

$$\begin{aligned} \min \limits _{C \in {\mathbb {R}}^{m \times K}} \left\{ \; \text{ Var}_{\pi }(C) = \sum \limits _{i=1}^N \sum \limits _{k=1}^K \pi _i^{(k)}(C) \Vert v_i - c_k \Vert _2^2 \;\right\} . \end{aligned}$$
(1.4)

For \(\sigma > 1\), this can be done by the standard algorithms of general nonlinear optimization. However, note that function \(\text{ Var}_{\pi }(\cdot )\) can have very complicated topological structure. So, the only possible theoretical guarantee for these algorithms is the convergence to local minima, which are most probably not unique.

At this moment, there exist hundreds of different algorithms for computing hard and soft clusterings (see, for example, the surveys Bora and Gupta 2014; Peters et al. 2013; Xu and Wunsch 2005; Yang 1993). However, to the best of our knowledge, all of them a heuristic. None of them is supported by rigorous theoretical statements on its efficiency and the quality of the generated solutions. Such a situation does not look very striking since both variants of clustering problems (1.3) and (1.4) look computationally difficult in view of their non-convexity.

The main goal of this paper is to show that a small change in the problem setting makes our goal computationally tractable. In order to explain the origin of our approach, let us try to find some successful examples of clustering in the big systems, existing in real life.Footnote 1 For our purposes, the most interesting could be a proper understanding of electoral procedures in modern democratic states.

Since the political parties must reflect the aggregate preferences of a big group of voters, this is indeed a good example of clustering in real life—which has already proved its efficiency by two hundred years of elections in the USA. Note that this procedure should have an enormous stabilizing effect, which allows the country to recover quickly after external and internal shocks and remain functioning in accordance with the interests and experience of the society.Footnote 2

In Operations Research, behavior of a stable system is usually explained by existence of a convex potential function, which must be minimized by the natural evolution of the system. Thus, one of the main goals of this paper is a construction of a convex model for sequential elections, which can help in soft clustering of voters, taking into account their internal preferences. In our model, the voter is characterized by a set of opinions (features), which can be compared with the positions of parties. The opinions of voters are fixed. However, the positions of parties may be changed in accordance with observed opinions of attracted voters. At the same time, party cannot go too far from its basic declarations (core values). For each voter, we compute the probabilities for selecting each party. These probabilities are adjusted after each round of consecutive elections. Thus, in our approach probabilities justify a computational tool for approaching a good clustering.

Our main result is that under some natural behavioral assumptions, the electoral system is stable. This means that the sequential election procedure converges to a unique fixed point. Moreover, this convergence is fast (linear). Thus, we get a computationally efficient algorithm for computing soft clusters. It has a form of an alternating minimization method for a convex potential function. From the technical point of view, the main novelty of our approach consists in a complete elimination of the least squares principle [compared with (1.1), (1.2), and (1.4)]. Instead, the membership values are computed, for example, as

$$\begin{aligned} p_i^{(k)} = {\mathrm{e}^{-\Vert v_i - c_k \Vert /\mu } \over \sum \nolimits _{j=1}^K \mathrm{e}^{-\Vert v_i - c_j \Vert /\mu }}, \quad k =1, \ldots , K, \; i = 1, \ldots , N, \end{aligned}$$
(1.5)

where \(\mu > 0\) is a volatility coefficient. Note that the norms in this expression are not squared.

1.2 Contents

The paper is organized as follows: In Sect. 2, we develop all necessary mathematical tools for analyzing the global convergence of an alternating minimization scheme as applied to a bivariate potential function with a multiplicative interaction term. We prove sufficient conditions for its convexity and establish a linear rate of convergence for alternating minimization.

In Sect. 3, we describe our electoral model and introduce the main behavioral assumptions for the voters and political parties. We define the stable systems as systems for which the sequential election procedure has a unique fixed point. In Sect. 4, we justify the conditions for stability, using bivariate potential function studied in Sect. 2. It is important that for measuring divergence in the opinions of voters and positions of the parties we can use an arbitrary Lipschitz continuous function. In any case, the rate of convergence of the whole process is linear.

In Sect. 5, we rewrite the stability condition of the system in terms of the individual tolerance of the party and discuss the computational cost of the proposed model. In the simplest case of measuring the divergence in the opinions by a usual norm, this cost is proportional to \(O({1 \over \epsilon ^2})\), where \(\epsilon \) is the desired accuracy of the computed solution. However, we can use smooth approximations of the norms, or distances based on Huber function. In this case, the computational cost of our model is reduced. It becomes proportional to \(\ln {1 \over \epsilon }\) multiplied by some factors dependent on the parameters of our model.

In Sect. 6, we analyze a clustering scheme based on a direct minimization of the bivariate potential. For doing that, we replace the probabilistic assignment rule (1.5) by a Euclidean projection onto the standard simplex. As a consequence, we can apply to our model the standard gradient methods of smooth convex optimization with linear rate of convergence. Consequently, we get the worst-case estimate for the total computational expenses of the order \(O(KmN \ln {1 \over \epsilon })\) arithmetic operations. This is the fastest clustering scheme in our paper. Recall that for the best of our knowledge, in our paper we present the first complexity results in this field.

1.3 Notation and generalities

In this paper, we denote by \({\mathbb {E}}\) with an optional subscript a finite-dimensional real vector space, and by \({\mathbb {E}}^*\) the corresponding dual space. For linear function \(s \in {\mathbb {E}}^*\), denote its value at \(x \in {\mathbb {E}}\) by \(\langle s, x \rangle _{{\mathbb {E}}}\). If no ambiguity arise, the subscript is usually omitted. For any norm \(\Vert \cdot \Vert _{{\mathbb {E}}}\) in \({\mathbb {E}}\), we define its dual norm in the standard way,

$$\begin{aligned} \Vert s \Vert ^*_{{\mathbb {E}}} = \max \limits _{x \in {\mathbb {E}}} \{ \langle s, x \rangle : \; \Vert x \Vert _{{\mathbb {E}}} \le 1 \}, \quad s \in {\mathbb {E}}^*, \end{aligned}$$
(1.6)

which ensures validity of Cauchy–Schwartz inequality:

$$\begin{aligned} \langle s, x \rangle _{{\mathbb {E}}} \le \Vert s \Vert ^*_{{\mathbb {E}}} \cdot \Vert x \Vert _{{\mathbb {E}}}, \quad s \in {\mathbb {E}}^*, \; x \in {\mathbb {E}}. \end{aligned}$$
(1.7)

If \({\mathbb {E}}= {\mathbb {R}}^n\), the space of real column vectors \(x = (x^{(1)}, \ldots , x^{(n)})^{\mathrm{T}}\), then \({\mathbb {E}}^* = {\mathbb {R}}^n\) and

$$\begin{aligned} \langle s, x \rangle = \sum \limits _{i=1}^n s^{(i)} x^{(i)}, \quad s, x \in {\mathbb {R}}^n. \end{aligned}$$

Notation \({\mathbb {R}}^n_+\) is used for the positive orthant:

$$\begin{aligned} {\mathbb {R}}^n_+ = \{ x \in {\mathbb {R}}^n: \, x^{(i)} \ge 0, \, i=1,\ldots , n \}, \end{aligned}$$

and by \(\Delta _n\) we denote the standard simplex:

$$\begin{aligned} \Delta _n = \left\{ x \in {\mathbb {R}}^n_+: \, \sum \limits _{i=1}^n x^{(i)} =1 \right\} . \end{aligned}$$

The standard notation is used for \(\ell _p\) norms:

$$\begin{aligned} \Vert x \Vert _p = \left[ \sum \limits _{i=1}^n | x^{(i)} |^p \right] ^{1/p}, \quad x \in R^n, \; p \ge 1, \end{aligned}$$

with \(\Vert x \Vert _{\infty } = \max \limits _{1 \le i \le n} | x^{(i)} |\).

For function \(f(x), x \in {\mathbb {E}}\), we denote by \(\nabla f(x) \in {\mathbb {E}}^*\) its gradient at x. If f is a non-differentiable convex function, the same notation is used for its subgradient. If function \(f(\cdot , \cdot )\) has two arguments, notation \(\nabla _2 f(x,y)\) corresponds to its gradient with respect to variable y. Finally, for twice differentiable function f we denote by \(\nabla ^2 f(x)\) its Hessian at x. For x being fixed, this is a linear operator from \({\mathbb {E}}\) to \({\mathbb {E}}^*\).

Recall that function f is called strongly convex on convex set \(Q \subseteq {\mathbb {E}}\) with convexity parameter \(\sigma > 0\) if for all \(x, y \in Q\) and \(\alpha \in [0,1]\) we have

$$\begin{aligned}&f(\alpha x+ (1-\alpha )y) \le \alpha f(x) + (1-\alpha ) f(y)\nonumber \\&\quad - {\sigma \over 2} \alpha (1-\alpha ) \Vert x - y \Vert ^2_{{\mathbb {E}}}. \end{aligned}$$
(1.8)

The most important property of strongly convex function is that it attains a unique minimum on any closed convex set Q. Therefore, if \(x^* = \arg \min \nolimits _{x \in Q} f(x)\), then for any \(x \in Q\) and \(\alpha \in (0,1)\) we have

$$\begin{aligned}&f(x^*)&\le f(\alpha x^* + (1-\alpha )x) \; {\mathop {\le }\limits ^{(1.8)}} \; \alpha f(x^*) \\&\quad + \,(1-\alpha ) f(x)-{\sigma \over 2} \alpha (1-\alpha ) \Vert x - x^* \Vert ^2_{{\mathbb {E}}}. \end{aligned}$$

Dividing the difference of the right- and left-hand side of this inequality \((1-\alpha )\) and tending \(\alpha \) to one, we get

$$\begin{aligned} f(x) \ge f(x^*) + {\sigma \over 2} \Vert x - x^* \Vert ^2_{{\mathbb {E}}}, \quad x \in Q. \end{aligned}$$
(1.9)

2 Bivariate potential with multiplicative interaction cost

Let \({\mathbb {E}}, {\mathbb {E}}_1\), and \({\mathbb {E}}_2\) be three finite-dimensional real vector spaces. Consider the following function of two variables:

$$\begin{aligned} \Phi (x,p) = f(x) + g(x,p) + h(p), \quad x \in {\mathbb {E}}_1, \; p \in {\mathbb {E}}_2. \end{aligned}$$
(2.1)

where functions f and h are closed and convex on their domains. Our main structural assumption on function \(\Phi \) is that the interaction term g has a multiplicative form:

$$\begin{aligned} g(x,p)= & {} \langle G_1(x), G_2(p) \rangle _{{\mathbb {E}}}, \nonumber \\&\quad G_1: {\mathbb {E}}_1 \rightarrow {\mathbb {E}}^*, \quad G_2: {\mathbb {E}}_2 \rightarrow {\mathbb {E}}. \end{aligned}$$
(2.2)

In Sect. 4, we will see an application example for a potential having exactly this structure.

The main goal of this section consists in developing natural conditions for convexity of function \(\Phi \) and analyzing the performance of alternating minimization approach for solving the corresponding optimization problems.

Denote by \(Q_1\) and \(Q_2\) the closed convex feasible sets in the spaces \({\mathbb {E}}_1\) and \({\mathbb {E}}_2\), respectively. And let \(Q = Q_1 \times Q_2\).

Assumption 1

For any \(x \in Q_1\) function \(g(x, \cdot ) = \langle G_1(x), G_2(\cdot ) \rangle \) is closed and convex on \(Q_2\), and for any \(p \in Q_2\) function \(g(\cdot ,p) = \langle G_1(\cdot ), G_2(p) \rangle \) is closed and convex on \(Q_1\).

In the remaining part of this section, we always assume that Assumption 1 is satisfied.

Example 1

Assumption 1 is valid, for example, for linear operators:

$$\begin{aligned} G_1(x) = A_1 x + b_1, \quad G_2(p) \; = \; A_2 p + b_2. \end{aligned}$$

Another important example is \(g(x,p) = \langle G_1(x), p \rangle \), where \(p \in {\mathbb {R}}^n_+\) and all components of the vector function \(G_1(x)\) are convex.

Assumption 1 has the following important consequence.

Lemma 1

For any two points \(z_0 = (x_0,p_0)\) and \(z_1 = (x_1,p_1)\) from Q and any \(\alpha \in [0,1]\), we have

$$\begin{aligned}&\langle G_1((1-\alpha )x_0 + \alpha x_1), G_2((1-\alpha )p_0 +\alpha p_1) \rangle \nonumber \\&\quad \le \langle (1-\alpha )G_1(x_0) + \alpha G_1(x_1),\nonumber \\&\quad (1-\alpha ) G_2(p_0) + \alpha G_2(p_1) \rangle . \end{aligned}$$
(2.3)

Proof

Denote \(z_{\alpha } = (x_{\alpha }, p_{\alpha }) = (1-\alpha )z_0 + \alpha z_1\). Then in view of Assumption 1 we have

$$\begin{aligned} \langle G_1(x_{\alpha }), G_2(p_{\alpha }) \rangle \le \langle G_1(x_{\alpha }), (1-\alpha )G_2(p_0) + \alpha G_2(p_1) \rangle . \end{aligned}$$

On the other hand, by the same reason

$$\begin{aligned} \langle G_1(x_{\alpha }), G_2(p_i) \rangle\le & {} \langle (1-\alpha )G_1(x_0)\\&+\,\alpha G_1(x_1), G_2(p_i) \rangle , \quad i = 1,2. \end{aligned}$$

Putting all inequalities together, we get (2.3). \(\square \)

Now we can justify a sufficient condition for convexity of the potential \(\Phi (\cdot ,\cdot )\).

Theorem 1

Let function f be strongly convex on \(Q_1\) with parameter \(\sigma _1\), and function h be strongly convex on \(Q_2\) with parameter \(\sigma _2\). Assume that the operators \(G_1(\cdot )\) and \(G_2(\cdot )\) are Lipschitz continuous:

$$\begin{aligned} \Vert G_1(x_0) - G_1(x_1) \Vert ^*_{{\mathbb {E}}}\le & {} L_1 \Vert x_0 - x_1 \Vert _{{\mathbb {E}}_1}, \,\, x_0, x_1 \in Q_1,\nonumber \\ \Vert G_2(p_0) - G_2(p_1) \Vert _{{\mathbb {E}}}\le & {} L_2 \Vert p_0 - p_1 \Vert _{{\mathbb {E}}_2}, \,\, p_0, p_1 \in Q_2. \end{aligned}$$
(2.4)

If the constants \(L_1\) and \(L_2\) are small enough, namely, if

$$\begin{aligned} L^2_1 L_2^2 \le \sigma _1 \sigma _2, \end{aligned}$$
(2.5)

then function \(\Phi \) is convex on \(Q = Q_1 \times Q_2\). If inequality (2.5) is strict, then \(\Phi \) is strongly convex on Q.

Proof

Let us fix two points \(z_0 = (x_0,p_0)\) and \(z_1 = (x_1,p_1)\) from Q. Consider the intermediate point \(z_{\alpha } = (1-\alpha )z_0 + \alpha z_1\) with \(\alpha \in [0,1]\). Then, since \(p_{\alpha } \in Q_2\), we have

$$\begin{aligned} \Phi (z_{\alpha })&= f(x_{\alpha }) + \langle G_1(x_{\alpha }), G_2(p_{\alpha }) \rangle + h(p_{\alpha }) \\&{\mathop {\le }\limits ^{(2.3)}} f(x_{\alpha }) +\langle (1-\alpha ) G_1(x_0) \\&\quad + \alpha G_1(x_1) , (1-\alpha ) G_2(p_0) + \alpha G_2(p_1) \rangle + h(p_{\alpha })\\&{\mathop {\le }\limits ^{(1.8)}} (1-\alpha ) f(x_0) + \alpha f(x_1) -{\sigma _1 \over 2} \alpha (1-\alpha ) \Vert x_1 - x_0 \Vert ^2 \\&\quad + \langle (1-\alpha ) G_1(x_0) + \alpha G_1(x_1) , (1-\alpha ) G_2(p_0) + \alpha G_2(p_1) \rangle \\&\quad + (1-\alpha ) h(p_0) + \alpha h(p_1) - {\sigma _2 \over 2} \alpha (1-\alpha ) \Vert p_1 - p_0 \Vert ^2\\&= (1-\alpha )[\Phi (z_0) - \langle G_1(x_0),G_2(p_0) \rangle ]\\&\quad +\alpha [\Phi (z_1) - \langle G_1(x_1),G_2(p_1) \rangle ]\\&\quad + \langle (1-\alpha ) G_1(x_0) + \alpha G_1(x_1) , (1-\alpha ) G_2(p_0) + \alpha G_2(p_1) \rangle \\&\quad -{\alpha (1- \alpha ) \over 2} [ \sigma _1 \Vert x_1 - x_0 \Vert ^2 + \sigma _2 \Vert p_1 - p_0 \Vert ^2]\\&= (1-\alpha ) \Phi (z_0) + \alpha \Phi (z_1) -\alpha (1-\alpha ) \delta (z_0,z_1), \end{aligned}$$

where \(\delta (z_0,z_1) = {\sigma _1 \over 2} \Vert x_1 - x_0 \Vert ^2 +{\sigma _2 \over 2} \Vert p_1 - p_0 \Vert ^2 + \langle G_1(x_1) -G_1(x_0), G_2(p_1) - G_2(p_0) \rangle \). Assuming now that \(L_1^2 L_2^2 \le (\sigma _1 - \epsilon )(\sigma _2-\epsilon )\) for some small \(\epsilon \ge 0\), we have

$$\begin{aligned} \delta (z_0,z_1)&{\mathop {\ge }\limits ^{(2.4)}} {\sigma _1 \over 2} \Vert x_1 - x_0 \Vert ^2 + {\sigma _2 \over 2} \Vert p_1 - p_0 \Vert ^2\\&\quad - (\sigma _1 - \epsilon )^{1/2}(\sigma _2-\epsilon )^{1/2} \Vert x_1 - x_0 \Vert \cdot \Vert p_1 - p_0 \Vert \\&\ge {\epsilon \over 2} \left( \Vert x_1 - x_0 \Vert ^2 +\Vert p_1 - p_0 \Vert ^2 \right) . \end{aligned}$$

Thus, if \(\epsilon > 0\), then function \(\Phi \) is strongly convex in view of definition (1.8). If \(\epsilon = 0\), then \(\Phi \) is just a convex function. \(\square \)

Simple example with \(Q_1 = {\mathbb {R}}^n, Q_2 = {\mathbb {R}}^m\), and

$$\begin{aligned} \Phi (x,p) = {\sigma _1 \over 2} \Vert x \Vert _2^2 + \langle A x, p \rangle + {\sigma _2 \over 2} \Vert p \Vert _2^2 \end{aligned}$$

shows that the condition (2.5) cannot be improved. In this example, \(L_1 = \Vert A \Vert \) and \(L_2 = 1\).

Consider the following minimization problem:

$$\begin{aligned} \min \limits _{z \in Q} \; \Phi (z), \end{aligned}$$
(2.6)

where the parameters of function \(\Phi \) satisfy a strict variant of condition (2.5). In this case, function \(\Phi \) is strongly convex. Consequently, there exists a unique solution

$$\begin{aligned} z^* = (x^*,p^*) \in Q \end{aligned}$$

of the problem (2.6). Let us show that it can be found by the following alternating minimization scheme.

$$\begin{aligned} \begin{array}{ll} \mathbf{Initialization}. &{} \text{ Choose }\ x_0 \in Q_1. \\ \mathbf{Iteration}\ t \ge 0. &{} \text{ a) } \text{ Find }\ p_{t+1} = \arg \min \limits _{p \in Q_2} \Phi (x_t,p). \\ &{} \text{ b) } \text{ Find }\ x_{t+1} = \arg \min \limits _{x \in Q_1} \Phi (x,p_{t+1}). \end{array} \end{aligned}$$
(2.7)

In order to analyze the convergence of this scheme, let us define the following operators:

$$\begin{aligned} u(x)= & {} \arg \min \limits _{p \in Q_2} \Phi (x, p) \; \in \; Q_2,\\ v(p)= & {} \arg \min \limits _{x \in Q_1} \Phi (x,p) \; \in \; Q_1. \end{aligned}$$

In view of Assumption 1, the objective function in both optimization problems above is strongly convex. Hence, the points \(u(\cdot )\) and \(v(\cdot )\) are well defined.

Lemma 2

Let the conditions of Theorem 1 be satisfied. Then, for any \(x_1, x_2 \in Q_1\) we have

$$\begin{aligned} \Vert u(x_1) - u(x_2) \Vert \le {L_1 L_2 \over \sigma _2} \Vert x_1 - x_2\Vert . \end{aligned}$$
(2.8)

Similarly, for any \(p_1\) and \(p_2\) from \(Q_2\) we have

$$\begin{aligned} \Vert v(p_1) - v(p_2) \Vert \le {L_1 L_2 \over \sigma _1} \Vert p_1 - p_2\Vert . \end{aligned}$$
(2.9)

Proof

Indeed, point u(x) is the solution of the problem \(\min \nolimits _{p \in Q_2} \{ \langle G_1(x), G_2(p) \rangle + h(p) \}\) with strongly convex objective function. Therefore, in view of the property (1.9), we have

$$\begin{aligned}&\langle G_1(x_1), G_2(u(x_2)) \rangle + h(u(x_2))\\&\quad \ge \langle G_1(x_1), G_2(u(x_1)) \rangle + h(u(x_1)) \\&\qquad + {\sigma _2 \over 2} \Vert u(x_2) - u(x_1) \Vert ^2,\\&\langle G_1(x_2), G_2(u(x_1)) \rangle + h(u(x_1)) \\&\quad \ge \langle G_1(x_2), G_2(u(x_2)) \rangle + h(u(x_2)) \\&\qquad + {\sigma _2 \over 2} \Vert u(x_2) - u(x_1) \Vert ^2. \end{aligned}$$

Adding these two inequalities, we get

$$\begin{aligned}&\sigma _2 \Vert u(x_2) - u(x_1) \Vert ^2 \\&\quad \le \langle G_1(x_1) -G_1(x_2), G_2(u(x_2)) - G_2(u(x_1)) \rangle \\&\quad \le \Vert G_1(x_1) - G_1(x_2) \Vert ^*_{{\mathbb {E}}} \cdot \Vert G_2(u(x_2)) - G_2(u(x_1)) \Vert _{{\mathbb {E}}}. \end{aligned}$$

Thus, in view of inequality (2.4), we get (2.8). The proof of Inequality (2.9) follows by the same argument. \(\square \)

The following theorem is a trivial consequence of Lemma 2. Define

$$\begin{aligned} T(x) = v(u(x)), \quad S(p) = u(v(p)). \end{aligned}$$

Theorem 2

Let \(\lambda {\mathop {=}\limits ^{\mathrm {def}}}{L_1^2 L_2^2 \over \sigma _1 \sigma _2} < 1\). Then, \(T(\cdot )\) and \(S(\cdot )\) are contracting mappings:

$$\begin{aligned} \Vert T(x_1) - T(x_2) \Vert\le & {} \lambda \Vert x_1 - x_2 \Vert , \quad x_1, x_2 \in Q_1,\nonumber \\ \Vert S(p_1) - S(p_2) \Vert\le & {} \lambda \Vert p_1 - p_2 \Vert , \quad p_1, p_2 \in Q_2. \end{aligned}$$
(2.10)

Note that in terms of operators T and S the process (2.7) can be written in the following way:

$$\begin{aligned} x_{t+1} = T(x_t), \quad p_{t+1} \; = \; S(p_t). \end{aligned}$$
(2.11)

Therefore, for any \(t \ge 0\), we have

$$\begin{aligned}&\Vert x_{t+1} - x^* \Vert = \Vert T(x_t) - T(x^*) \Vert \le \lambda \Vert x_t - x^* \Vert \\&\quad \le \cdots \; \le \; \lambda ^{t+1} \Vert x_0 - x^* \Vert . \end{aligned}$$

A similar rate of convergence can be established for the sequence \(\{ p_k \}_{k \ge 0}\). We conclude that the process (2.7) converges linearly to the solution of problem (2.6).

Note that from the viewpoint of standard optimization theory, problem (2.6) has possibly non-differentiable objective function with unbounded derivatives (this is allowed by Assumption 1). It has the only favorable property, the strong convexity of the objective. However, this is not enough for justifying linear rate of convergence of any standard optimization scheme.

In the next section, we consider an application example, where the equilibrium state of the system can be found by minimizing bivariate potential with a multiplicative interaction term.

3 Electoral model

Let us show how we can use the machinery developed in Sect. 2 for justifying a soft clustering model based on a stable electoral procedure. The basic elements of our model are interpreted as independent voters possessing some features (opinions). They must be attached to different clusters (political parties), which represent these features in the best possible way.

In our model, we have N independent voters and K political parties. Our main assumption is that the voting results are random. Voter i decides to vote for party k with probability \(p_i^{(k)}\). It is convenient to put these probabilities in a vector

$$\begin{aligned} p_i = \left( p_i^{(1)}, \ldots , p_i^{(K)}\right) ^{\mathrm{T}} \in \Delta _{K}. \end{aligned}$$

These vectors can be unified in a matrix \(P = (p_1, \ldots , p_N) \in {\mathbb {R}}^{K \times N}\), which we call the voting matrix. At the beginning of the voting process, this matrix is unknown. However, it will be computed as an outcome of a sequence of consecutive elections.

Let us try to explain the results of elections by some quantitative parameters. We assume that an opinion of voter i can be described by m different real values (personal preferences), which we put in vector \(v_i \in {\mathcal {V}} \subseteq {\mathbb {R}}^m, i = 1, \ldots , N\), where \({\mathcal {V}}\) is a closed convex set (e.g., \({\mathcal {V}} = {\mathbb {R}}^m_+\)). These vectors are fixed during the whole history of consecutive elections.

At the same time, positions \(x_k \in {\mathcal {V}}\) of political parties are flexible, \(k = 1, \ldots , K\). After each round of elections, these values can be adjusted for better representing the positions of the voters closely attached to this party. It will be convenient to keep these vectors in a matrix \(X = (x_1, \ldots , x_{K}) \in {\mathbb {R}}^{m \times K}\).

Let us fix some distance function \(\rho (v,x) \ge 0\), which is used for measuring the distance between the opinion \(v \in {\mathcal {V}}\) of a voter and current position \(x \in {\mathcal {V}}\) of a political party. In what follows, we always assume that for any v being fixed the function \(\rho (v, \cdot )\) is convex. A natural choice for this function would be \(\rho (v,x) = \Vert x - v \Vert \), where \(\Vert \cdot \Vert \) is an arbitrary norm in \({\mathbb {R}}^m\). However, in Sect. 5, we will give more examples with some motivation for their use.Footnote 3

Clearly, the bigger the distance between v and x is, the smaller should be the probability that this party will be selected by this particular voter. In our electoral model, for this decision we apply the discrete choice probabilities of logit model (e.g., Anderson et al. 1992).

Assumption 2

Voter i selects the kth party with probability

$$\begin{aligned} p_i^{(k)}(X) = \mathrm{e}^{- \rho (v_i,x_k)/\mu } / \left[ \sum \limits _{j = 1}^{K} \mathrm{e}^{-\rho (v_i,x_j)/\mu } \right] , \quad k = 1, \ldots , K, \end{aligned}$$
(3.1)

where \(\mu \ge 0\) is the flexibility parameter, which represents the volatility of opinions of voters.

Denote by \(P_*(X)= (p_1(X), \ldots , p_N(X))\) the corresponding voting matrix.

In Assumption 2, value \(\mu = 0\) corresponds to the deterministic choice: the voter always chooses a party, which is the closest to his/her opinion. However, usually this parameter is strictly positive.

It is important that the probability vector \(p_i(X)\) has an optimization interpretation. Consider the entropy function

$$\begin{aligned} \eta (p) = \sum \limits _{k=1}^{K} p^{(k)} \ln p^{(k)}, \quad p \in {\mathbb {R}}^{K}_+, \end{aligned}$$

It is easy to check (see, for example, Lemma 4 in Nesterov 2005a) that

$$\begin{aligned} p_i(X) = \arg \min \limits _{p \in \Delta _{K}} \{ \langle g_i(X), p \rangle + \mu \eta (p) \}, \end{aligned}$$
(3.2)

where \(g_i(X) = (\rho (v_i,x_1), \ldots , \rho (v_i,x_{K}))^{\mathrm{T}}\). Note that function \(\eta (\cdot )\) is strongly convex on the standard simplex. Indeed, for any \(p \in \mathrm{int \,}\Delta _{K}\) and \(h \in {\mathbb {R}}^{K}\) we have

$$\begin{aligned} \langle \nabla ^2 \eta (x) h, h\rangle = \sum \limits _{k=1}^{K} {(h^{(k)})^2 \over p^{(k)}}. \end{aligned}$$

At the same time,

$$\begin{aligned} \min \limits _{p \in \Delta _{K}} \left\{ \sum \limits _{k=1}^{K} {(h^{(k)})^2 \over p^{(k)}} \right\} =\Vert h \Vert _1^2. \end{aligned}$$
(3.3)

Thus, the entropy function is strongly convex on \(\Delta _{K}\) in \(\ell _1\)-norm with convexity parameter one.

It remains to describe the behavior of political parties. Each party is able to modify its position in order to attract the maximal number of voters. However, they should not go too far from their core values, which we denote by \(c_k \in {\mathcal {V}}, k=1, \ldots , K\). In order to measure the distance from the current position of the party and its core value, we introduce a prox-function d(xy). It must satisfy the following conditions:

  • \(d(x,y) \ge 0\) for all \(x,y \in {\mathcal {V}}\).

  • For each \(x \in {\mathcal {V}}\), function \(d(x,\cdot )\) is strongly convex in the second argument with convexity parameter one:

    $$\begin{aligned} d(x,v)\ge & {} d(x,y) + \langle \nabla _2 d(x,y), v - y \rangle +{1\over 2} \Vert v - y \Vert ^2,\nonumber \\&\forall v,y \in {\mathcal {V}}, \end{aligned}$$
    (3.4)

where \(\Vert \cdot \Vert \) is an arbitrary norm in \({\mathbb {R}}^m\).

Let us give several examples of the most important prox-functions.

  • Kullback–Leibler divergence

    $$\begin{aligned} d_1(x,y)= & {} \eta (y) - \eta (x) - \langle \eta (x), y - x \rangle \\= & {} \sum \limits _{k=1}^{K} y^{(k)} \ln {y^{(k)} \over x^{(k)}}, \quad x, y \in \Delta _{K}. \end{aligned}$$

    This function is strongly convex in y in \(\ell _1\)-norm.

  • Euclidean distance \(d_2(x,y) = {1\over 2} \Vert x - y \Vert ^2_2, x, y \in {\mathbb {R}}^{K}\). This function is strongly convex in \(\ell _2\)-norm.

  • We can take \(\tilde{d}_i(x,y) = d_i(x,y) + \epsilon \Vert x -y \Vert _i\) with \(\epsilon \ge 0, i =1, 2\). The additional linear term gives to a party more chances to keep its core values unchanged.

The behavior of political parties is described by the following assumption.

Assumption 3

For a given voting matrix \(P = ( p_1, \ldots , p_{N} ) \in \Delta _{K}^N\), each political party chooses its optimal current position by minimizing the function

$$\begin{aligned} \psi _k(P,x_k) {\mathop {=}\limits ^{\mathrm {def}}}\sum \limits _{i=1}^N p_i^{(k)} \rho (v_i,x_k) + {1 \over \tau } d(c_k,x_k), \quad k = 1, \ldots , K, \end{aligned}$$
(3.5)

in \(x_k \in {\mathcal {V}}\).

In this definition, \(\tau > 0\) is a group tolerance parameter.Footnote 4 The objective function (3.5) has an interpretation of the expected distance between the opinions of all attracted voters and the current position of the kth party, augmented by the discrepancy with its core values.

Since \(\psi _k(P, \cdot )\) is strongly convex, there exists its unique minimum \(x^*_k(P)\) over \({\mathcal {V}}\). Denote \(X_*(P) = (x_1^*(P), \ldots , x^*_N(P))\).

Consider now the following process of sequential elections:

$$\begin{aligned} \mathbf{Set}\ X_0= & {} (c_1, \ldots , c_{K}).\; \mathbf{Repeat:}\; P_{t+1}\nonumber \\= & {} P_*(X_t), \; X_{t+1} = X_*(P_{t+1}), \; t \ge 0. \end{aligned}$$
(3.6)

The interpretation of process (3.6) is straightforward:

Given the current positions of political parties \(X_t\), voters announce their preferences \(P_{t+1}\) during the electoral poll. After observing the results, parties update their positions \(X_{t+1}\) for the next elections.

Definition 1

Electoral system is called stable if the process (3.6) has a unique limiting point, which is independent on the starting position \(X_0 \in {\mathcal {V}}^{K}\).

In the next section, we derive some sufficient conditions of the electoral stability. Note that in our model we have two tolerance parameters, \(\mu \) and \(\tau \).

4 Stable electoral systems

Let us describe the electoral process (3.6) using the framework of Sect. 2. Denote \({\mathbb {E}}_1 = {\mathbb {R}}^{m \times K}\) the space for positions \(X = (x_1, \ldots , x_{K})\) of the parties. Then, \(Q_1 = {\mathcal {V}}^{K}\). For the voting matrix \(P=(p_1, \ldots , p_N)\) we introduce the space \({\mathbb {E}}_2 = {\mathbb {R}}^{K \times N}\) and the feasible set \(Q_1 = \Delta _{K}^N\). For the interaction term, we have \({\mathbb {E}}= {\mathbb {E}}_2, G_2(P) \equiv P\), and \(g(X,P)= \langle G_1(X), P \rangle _{{\mathbb {E}}_2}\), where the distance matrix

$$\begin{aligned} G_1(X) = (g_1(X), \ldots , g_N(X)) \end{aligned}$$

is component-wise convex in X. Since all matrices \(P \in Q_2\) have nonnegative elements, function \(g(\cdot ,\cdot )\) satisfy Assumption 1.

In view of the rules (3.2) and (3.5), the electoral process (3.6) can be seen as alternating minimization scheme (2.7) being applied to the following bivariate potential

$$\begin{aligned} \Phi (X,P)= & {} {1 \over \tau } f(X) + \langle G_1(X), P \rangle _{{\mathbb {E}}_2} + \mu h(P),\nonumber \\ f(X)= & {} \sum \limits _{k=1}^{K} d(c_k,x_k),\nonumber \\ h(P)= & {} \sum \limits _{i=1}^N \eta (p_i). \end{aligned}$$
(4.1)

Now we need to choose appropriate norms in \({\mathbb {E}}_1\) and \({\mathbb {E}}_2\), which ensure good values of convexity parameters for function f and h. For \(X \in Q_1\) and direction \(U \in {\mathbb {E}}_1\), we have

$$\begin{aligned} \langle \nabla ^2 f(X)U,U\rangle&{\mathop {=}\limits ^{(4.1)}}&\sum \limits _{k=1}^{K} \langle \nabla ^2_2 d(c_k,x_k) u_k, u_k \rangle \\&{\mathop {\ge }\limits ^{(3.4)}}&\sum \limits _{k=1}^{K} \Vert u_k \Vert ^2. \end{aligned}$$

Thus, the natural choice of the norm for \({\mathbb {E}}_1\) is as follows:

$$\begin{aligned} \Vert U \Vert _{{\mathbb {E}}_1} =\left[ \sum \limits _{k=1}^{K} \Vert u_k \Vert ^2 \right] ^{1/2}, \quad U \in {\mathbb {E}}_1. \end{aligned}$$
(4.2)

In this case, the convexity parameter of function f is one, and we can choose \(\sigma _1 = {1 \over \tau }\).

For \(P \in Q_2\) and direction \(H \in {\mathbb {E}}_2\), we have

$$\begin{aligned} \langle \nabla ^2 h(P) H, H \rangle&{\mathop {=}\limits ^{(4.1)}}&\sum \limits _{i=1}^N \langle \nabla ^2 \eta (p_i) h_i, h_i \rangle \\&{\mathop {\ge }\limits ^{(3.3)}}&\sum \limits _{i=1}^N \Vert h_i \Vert ^2_1. \end{aligned}$$

Therefore, the natural choice for the norm in \({\mathbb {E}}_2\) is

$$\begin{aligned} \Vert H \Vert _{{\mathbb {E}}_2} = \left[ \sum \limits _{i=1}^N \Vert h_i \Vert ^2_1 \right] ^{1/2}, \quad H \in {\mathbb {E}}_2. \end{aligned}$$

In this case, function h has convexity parameter one and we can choose \(\sigma _2 = \mu \).

It remains to estimate the Lipschitz constants for inequality (2.4). In our case, \(L_2 = 1\). For estimating \(L_1\), we need to compute first the dual norm \(\Vert \cdot \Vert ^*_{{\mathbb {E}}_2}\). For arbitrary \(S \in {\mathbb {R}}^{m \times N}\), we have

$$\begin{aligned}&\max \limits _{\Vert H \Vert _{{\mathbb {E}}_2} \le 1} \langle S, H \rangle \\&\quad =\max \limits _{H,\tau } \left\{ \sum \limits _{i=1}^N \langle s_i, h_i \rangle : \; \Vert h_i \Vert _1 \le \tau ^{(i)}, i = 1, \ldots , N, \; \Vert \tau \Vert _2 \le 1 \right\} \\&\quad = \max \limits _{\tau } \left\{ \sum \limits _{i=1}^N \tau ^{(i)} \Vert s_i \Vert _{\infty } : \Vert \tau \Vert _2 \le 1 \right\} . \end{aligned}$$

Thus, we get the following dual norm:

$$\begin{aligned} \Vert S \Vert _{{\mathbb {E}}_2}^* = \left[ \sum \limits _{i=1}^N \Vert s_i \Vert _{\infty }^2 \right] ^{1/2}, \quad S \in {\mathbb {R}}^{m \times N}. \end{aligned}$$

Now we can get the Lipschitz constant of the mapping \(G_1(\cdot )\). For that, we need one more assumption.

Assumption 4

For any \(v \in {\mathcal {V}}\), function \(\rho (v,\cdot )\) is Lipschitz continuous in its second argument with constant one.

Then, for X and Y from \(Q_1\), we have

$$\begin{aligned} \Vert G_1(X) - G_1(Y) \Vert ^*_{{\mathbb {E}}_2} = \left[ \sum \limits _{i=1}^N \Vert g_i(X) - g_i(Y) \Vert _{\infty }^2 \right] ^{1/2}. \end{aligned}$$

Note that

$$\begin{aligned} g_i(X)= & {} (\rho (v_i,x_1), \ldots , \rho (v_i,x_{K})),\\ g_i(Y)= & {} (\rho (v_i,y_1), \ldots , \rho (v_i,y_{K})). \end{aligned}$$

Therefore, in view of Assumption 4,

$$\begin{aligned}&\Vert g_i(X) - g_i(Y) \Vert _{\infty }^2 \\&\quad = \max \limits _{1 \le k \le K} | \rho (v_i,x_k) - \rho (v_i,y_k) | \; \le \; \max \limits _{1 \le k \le K} \Vert x_k - y_k \Vert . \end{aligned}$$

Hence,

$$\begin{aligned}&\Vert G_1(X) - G_1(Y) \Vert ^*_{{\mathbb {E}}_2} \\&\quad \le N^{1/2} \max \limits _{1 \le k \le K} \Vert x_k - y_k \Vert \; {\mathop {\le }\limits ^{(4.2)}} \; N^{1/2} \, \Vert X - Y \Vert _{{\mathbb {E}}1}. \end{aligned}$$

Thus, mapping \(G_1\) is Lipschitz continuous with constant \(L_1=N^{1/2}\). Now, using Theorem 2, we come to the following statement.

Theorem 3

Let the behavior of voters and political parties satisfy Assumptions 23 with tolerance parameters

$$\begin{aligned} N < {\mu \over \tau }. \end{aligned}$$
(4.3)

Then, the corresponding electoral system is stable. Moreover, for the stationary voting matrix \(P_* = \lim \nolimits _{t \rightarrow \infty } P_t\), where matrices \(\{P_t \}\) are generated by the process (3.6), we have

$$\begin{aligned} \Vert P_t - P_* \Vert _{{\mathbb {E}}_2} \le \lambda ^t N^{1/2} ,\quad t \ge 0. \end{aligned}$$
(4.4)

Proof

Note that condition (4.3) is a strict version of condition (2.5). In terms of Theorem 2, we have \(P_{t+1} = S(P_t)\) and

$$\begin{aligned} \Vert S(P_1) - S(P_2) \Vert \le \lambda \Vert P_1 - P_2 \Vert \quad \forall P_1,P_2 \in Q_2. \end{aligned}$$

Note that for any \(p_1, p_2 \in \Delta _{K}\) we have \(\Vert p_1 - p_2 \Vert _1 \le 1\). Therefore, for any \(P \in Q_2\) it holds \(\Vert P - P_* \Vert \le N^{1/2}\). Thus, by Theorem 2, for all \(k \ge 1\) we get \(\Vert P_{t} - P_* \Vert \le \lambda ^t N^{1/2}\). \(\square \)

5 Computational aspects of alternating minimization

Let us find an interpretation of the stability condition (4.3). For that, we introduce the individual tolerance parameter

$$\begin{aligned} \hat{\tau } = \tau N. \end{aligned}$$

Then, the optimization problem (3.5), defining the response of the party on the voting probabilities, can be rewritten as

$$\begin{aligned} \min \limits _{x_k \in {\mathcal {V}}} \left\{ \hat{\psi }_k(P,x_k) = {1 \over N} \sum \limits _{i=1}^N p_i^{(k)} \rho (v_i,x_k) + {1 \over \hat{\tau }} d(c_k,x_k) \right\} . \end{aligned}$$
(5.1)

This means that each party reacts on the average opinion of the attracted voters. With this notation, the stability condition \(\lambda < 1\) can be rewritten as

$$\begin{aligned} \hat{\tau } < \mu . \end{aligned}$$
(5.2)

In view of its importance, we rewrite this condition in form of a theorem.

Theorem 4

Electoral system is stable if the individual tolerance of parties is smaller than the volatility of voters.

This condition looks indeed natural for guaranteeing stability of electoral system: parties must be more conservative in changing their opinions than the voters. In other words, an excessive populism of parties may cause political instability.

Let us discuss now the computational complexity of the electoral‘ process. For simplicity, we assume that the set \({\mathcal {V}}\) is bounded: \(D {\mathop {=}\limits ^{\mathrm {def}}}\max \nolimits _{x, y \in {\mathcal {V}}} \Vert x - y \Vert <\infty \). Each step in the recurrence (3.6) consists of two time-consuming operations.

  • Computation of the optimal voting matrix \(P_*(X)\). For that, we need to compute first the distance matrix \(G_1(X)\). If the cost of computation of the value \(\rho (v,x)\) is O(m) operation, then the distance matrix can be computed in O(KmN) operations. After that, we can apply the rule (3.1), which is cheaper: O(KN) operations. Thus, in total we need

    $$\begin{aligned} T_1 = O(K m N) \end{aligned}$$

    operations.

  • Computation of new positions of the parties \(X_*(P)\). For that, we need to solve K optimization problems (5.1). Note that just one computation of the objective functions for all these problems needs O(KmN) operations. Therefore, the importance of this step for our approach is dominating. Complexity of problem (5.1) depends on efficiency of the applied optimization schemes. And their potential efficiency depends on the properties of function \(\rho (v,\cdot )\). Let us discuss these aspects in more details. Note that the objective function in this problem is strongly convex with parameter

    $$\begin{aligned} \hat{\sigma } = {1 \over \hat{\tau }}. \end{aligned}$$

In our approach, in view of Assumption 4, we cannot follow the standard advice to use

$$\begin{aligned} \rho (v,x)= {1\over 2} \Vert x - x \Vert _2^2, \quad d(c,x) = {1\over 2} \Vert x - c \Vert _2^2. \end{aligned}$$

Recall that this choice is mainly motivated by an extremely low cost of solving the problem (5.1) (only O(mN) operations). However, we pay for that by non-convexity of the potential, existence of many local minima and most probably by NP-hardness of finding the global solution.

Thus, we need to choose a Lipschitz continuous distance function. Let us look at the most reasonable variants.

  1. 1.

    Arbitrary norm \(\rho _1(v,x) = \Vert v - x \Vert \). This function clearly satisfies Assumption 4. With this choice, problem (5.1) becomes a problem of convex optimization with non-differentiable strongly convex objective function. From the general theory, we know that \(\epsilon \)-solution of this problem can be found in

    $$\begin{aligned} O\left( {\hat{\tau } \over \epsilon } M^2 \right) \end{aligned}$$
    (5.3)

    iterations, where M is an upper bound for the norms of subgradients gradients of function \(\hat{\psi }_k(P,\cdot )\), and \(\epsilon \) is the required accuracy of solving the problem (5.1). Since the objective function of this problem has implicit structure, we can apply the smoothing technique (see Nesterov 2005a, b) for finding an \(\epsilon \)-solution of this problem in \(O\left( \sqrt{\hat{\tau } \over \epsilon } M \right) \) iterations. This bound is much better than (5.3). In order to estimate the total computational time, we need to multiply this bound by K, the number of parties, and by mN, the complexity of computing the value and subgradient of function \(\hat{\psi }_k(P,\cdot )\). Thus, we get the following estimate for computational time:

    $$\begin{aligned} T_2 = \sqrt{\hat{\tau } \over \epsilon } M \cdot K m N. \end{aligned}$$

    Note that this bound depends on two parameters, which can be potentially big. The accuracy \(\epsilon \) must be chosen proportionally of the required accuracy of approximation of matrix \(P_*\).Footnote 5 Thus, the impact of the term \({1 \over \sqrt{\epsilon }}\) can be significant. The constant M can be estimated from above as follows:

    $$\begin{aligned} M&= \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2 \hat{\psi }_k(P,x)\Vert ^* \\&{\mathop {\le }\limits ^{(5.1)}} \; {1 \over N} \sum \limits _{i=1}^N p_i^{(k)} + {1 \over \hat{\tau }} \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2d(c_k,x) \Vert ^*\\&\le 1 + {1 \over \hat{\tau }} \max \limits _{x \in {\mathcal {V}}} \Vert \nabla _2d(c_k,x) \Vert ^*. \end{aligned}$$

    Thus, M can be big if the diameter D is big.

  2. 2.

    Smoothed Euclidean norm \(\rho _2(v,x) = \sqrt{\delta ^2 + \Vert v - x \Vert ^2_2}\), where \(\delta > 0\) is a smoothing parameter. In this case, it is natural to choose \(d(c,x) ={1\over 2} \Vert x - c \Vert _2^2\). Note that \(\rho _2(v,\cdot )\) is infinitely times differentiable in its second argument. Moreover,

    $$\begin{aligned} \nabla _2\rho (v,x)&= {x - v \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{1/2}}, \quad \Vert \nabla _2 \rho (v,x) \Vert _2 \; \le \; 1,\nonumber \\ \nabla ^2_2 \rho (v,x)&= {I \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{1/2}} -{(x - v)(x-v)^{\mathrm{T}} \over [\delta ^2 + \Vert x-v \Vert ^2_2]^{3/2}} \; \preceq \; {1 \over \delta } I. \end{aligned}$$
    (5.4)

    Thus, function \(\rho _2(v,\cdot )\) satisfies Assumption 4, and it has Lipschitz continuous gradient with constant \({1 \over \delta }\). Consequently, function \(\hat{\psi }_k(c_k,\cdot )\) has Lipschitz continuous gradient with the constant \(L = {1 \over \delta } + {1 \over \hat{\tau }}\). Since this function is strongly convex with parameter \({1 \over \hat{\tau }}\), we conclude that its condition number is equal to

    $$\begin{aligned} \kappa = L/ \hat{\sigma } \; = \; 1 + {\hat{\tau } \over \delta }. \end{aligned}$$

    Thus, problem (5.1) can be solved by the usual gradient method in \(O\left( \kappa \ln {LD^2 \over \epsilon } \right) \) iterations. For the fast gradient methods, we get the bound of \(O\left( \kappa ^{1/2} \ln {LD^2 \over \epsilon } \right) \) iterations (see Sections 2.1 and 2.2 in Nesterov 2004). For the latter case, we get the following estimate of computational time:

    $$\begin{aligned} T_3 = O\left( \sqrt{ 1 + {\hat{\tau } \over \delta } } \ln {LD^2 \over \epsilon } \cdot K m N \right) . \end{aligned}$$
    (5.5)

    Clearly, this estimate is much better than \(T_2\).

    Note that Euclidean norm can be smoothed in many different ways. From the statistical point of view, sometimes it is reasonable to apply the Huber function which is used in robust regression for better resistance to outliers:

    $$\begin{aligned} \rho _H(v,x)= & {} \chi _{\delta }(\Vert x - v \Vert _2), \\ \chi _{\delta }(\tau )= & {} \left\{ \begin{array}{rl} \tau - {1\over 2} \delta , &{} \text{ if }\ \tau >\delta , \\ {1 \over 2 \delta } \tau ^2, &{} \text{ otherwise, } \end{array} \right. \quad \tau \ge 0. \end{aligned}$$

    It has the same properties as function \(\rho _2(v,\cdot )\) with the same estimate \(T_3\) for the computational time of one iteration of the process (3.6).

Recall that for estimating the total time, which is necessary for procedure (3.6) to converge to an approximate solution of problem

$$\begin{aligned} \min \limits _{X,P} \left\{ \Phi (X,P): \; X \in {\mathcal {V}}^{K}, \; P \in \Delta _{K}^N \right\} , \end{aligned}$$
(5.6)

we need to multiply the bounds \(T_i\) by an estimate for the total number of steps of the scheme. In view of Theorem 3, in order to get \(\Vert P_k - P_* \Vert _{{\mathbb {E}}_2} \le \epsilon _P\), we need at most

$$\begin{aligned} 1 + {1 \over \ln {1 \over \lambda }} \ln {N^{1/2} \over (1 - \lambda ) \epsilon _P} \end{aligned}$$
(5.7)

iterations of the process (3.6).

6 Simplified soft clustering

In the previous section, we derived complexity bounds for the iterative process (3.6), which has interpretation of the sequential election procedure. However, the best complexity estimate (5.5), (5.7) is still quite heavy. One of the possibilities for improving the complexity bounds consists in considering direct methods for solving the minimization problem (5.6). Note that in its current form this problem is not simple. The reason is that function \(h(\cdot )\) in (4.1) has unbounded derivatives at the feasible set. Therefore, in this section we study a possibility of replacing it by a squared Euclidean norm. Of course, in this case we will loose the probabilistic interpretation of the soft assignment (3.1). However, from the viewpoint of clustering, the new potential is still useful for computing the attachment coefficients \(p_i^{(k)}(X)\).

Consider the following minimization problem:

$$\begin{aligned}&\min \limits _{X,P} \{ \; \Phi _2(X,P): \; X \in {\mathcal {V}}^{K}, \; P \in \Delta _{K}^N \},\nonumber \\&\Phi _2(X,P) \; = \; {1 \over \hat{\tau }} f(X) + {1 \over N}\langle G_1(X), P \rangle _{{\mathbb {E}}_2} + {\mu \over N} h(P), \end{aligned}$$
(6.1)

where \(f(X) = {1\over 2} \sum \nolimits _{k=1}^{K} \Vert x_k - c_k \Vert ^2_2, h(P) = {1\over 2} \sum \nolimits _{i=1}^N \Vert p_i \Vert ^2_2\), andFootnote 6

$$\begin{aligned} G_1(X)^{(k,i)} = \rho _2(v_i,x_k), \quad k = 1, \ldots , K, \; i = 1, \ldots , N. \end{aligned}$$

This potential has three positive parameters \(\hat{\tau }, \mu \) and \(\delta \) (see Item 2 above), which we assume to satisfy the following relations:

$$\begin{aligned} \hat{\tau }< \delta <\mu . \end{aligned}$$
(6.2)

Let us prove that under these conditions the potential \(\Phi _2(\cdot ,\cdot )\) is a strongly convex function with Lipschitz continuous gradient.

Let us fix a direction \(Z = (U,H) \in {\mathbb {E}}= {\mathbb {E}}_1 \times {\mathbb {E}}_2\) and choose

$$\begin{aligned} \Vert U \Vert ^2_{{\mathbb {E}}_1} = \sum \limits _{k=1}^{K} \Vert u_k \Vert ^2_2, \quad \Vert H \Vert ^2_{{\mathbb {E}}_2} = \sum \limits _{i=1}^N\Vert h_i \Vert ^2_2. \end{aligned}$$

We need to estimate from above and below the second derivative of the potential \(\Phi _2(X,P)\) along direction Z. Note thatFootnote 7

$$\begin{aligned}&\langle \nabla ^2\Phi (X,P) Z,Z \rangle _{{\mathbb {E}}}\\&\quad = {1 \over \hat{\tau }} \Vert U \Vert ^2_{{\mathbb {E}}_1} + {1 \over N} \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N P^{(k,i)} \langle \nabla ^2_2 \rho _2(v_i,x_k) u_k, u_k \rangle \\&\qquad + {2 \over N} \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N H^{(k,i)} \langle \nabla _2 \rho _2(v_i,x_k), u_k \rangle + {\mu \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2}. \end{aligned}$$

Since \(P \ge 0\), we have

$$\begin{aligned}&\sum \limits _{k=1}^{K} \sum \limits _{i=1}^N P^{(k,i)} \langle \nabla ^2_2 \rho _2(v_i,x_k) u_k, u_k \rangle \\&{\mathop {\le }\limits ^{(5.4)}}\sum \limits _{k=1}^{K} \sum \limits _{i=1}^N {1 \over \delta } \Vert u_k \Vert ^2_2 \; = \; {N \over \delta } \Vert U \Vert ^2_{{\mathbb {E}}_2}. \end{aligned}$$

At the same time,

$$\begin{aligned}&\left| \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N H^{(k,i)} \langle \nabla _2 \rho _2(v_i,x_k), u_k \rangle \right| \nonumber \\&\quad {\mathop {\le }\limits ^{(5.4)}} \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N | H^{(k,i)} | \cdot \Vert u_k \Vert _2 \nonumber \\&\quad \le \left[ \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N (H^{(k,i)})^2 \right] ^{1/2} \left[ \sum \limits _{k=1}^{K} \sum \limits _{i=1}^N \Vert u_k \Vert ^2_2 \right] ^{1/2}\nonumber \\&\quad = N^{1/2} \Vert U \Vert _{{\mathbb {E}}_1} \Vert H \Vert _{{\mathbb {E}}_2}. \end{aligned}$$
(6.3)

Thus,

$$\begin{aligned}&\langle \nabla ^2\Phi (X,P) Z,Z \rangle _{{\mathbb {E}}} \nonumber \\&\quad \le \left( {1 \over \hat{\tau }} +{1 \over \delta } \right) \Vert U \Vert ^2_{{\mathbb {E}}_1} + {2 \over N^{1/2}} \Vert U \Vert _{{\mathbb {E}}_1} \Vert H \Vert _{{\mathbb {E}}_2} + {\mu \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2}\nonumber \\&\quad \le \left( {1 \over \hat{\tau }} + {2 \over \delta } \right) \Vert U \Vert ^2_{{\mathbb {E}}_1} + {\mu + \delta \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2} \end{aligned}$$
(6.4)

On the other hand, since \(P \ge 0\) and function \(\rho _2(v_i, \cdot )\) is convex, we have

$$\begin{aligned}&\langle \nabla ^2\Phi (X,P) Z,Z \rangle _{{\mathbb {E}}} \\&\quad \ge {1 \over \hat{\tau }} \Vert U \Vert ^2_{{\mathbb {E}}_1} - {2 \over N^{1/2}} \Vert U \Vert _{{\mathbb {E}}_1} \Vert H \Vert _{{\mathbb {E}}_2} +{\mu \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2} \\&\quad \ge \left( {1 \over \hat{\tau }} - {1 \over \delta } \right) \Vert U \Vert ^2_{{\mathbb {E}}_1} +{\mu - \delta \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2}. \end{aligned}$$

In view of (6.2), we can choose in \({\mathbb {E}}\) the following Euclidean norm:

$$\begin{aligned}&\Vert Z \Vert ^2_{{\mathbb {E}}} = \left( {1 \over \hat{\tau }} - {1 \over \delta } \right) \Vert U \Vert ^2_{{\mathbb {E}}_1} + {\mu - \delta \over N} \Vert H \Vert ^2_{{\mathbb {E}}_2},\nonumber \\&\quad \quad Z=(U,H) \in {\mathbb {E}}= {\mathbb {E}}_1 \times {\mathbb {E}}_2. \end{aligned}$$
(6.5)

As we have seen, with respect to this norm this norm, potential \(\Phi _2\) is strongly convex on \({\mathbb {E}}\) with convexity parameter one. On the other hand, it has Lipschitz continuous gradient with constant

$$\begin{aligned} L_2 = \max \left\{ {\delta + 2 \hat{\tau } \over \delta -\hat{\tau }}, {\mu + \delta \over \mu - \delta } \right\} . \end{aligned}$$
(6.6)

Consequently, the condition number of this function is \(\kappa _2 =L_2\).

If there is no other reason for choosing the parameter \(\delta \), satisfying condition (6.2), we can try to use this degree of freedom for making the value \(\kappa _2\) as small as possible. For that, it is convenient to introduce another representation of our parameters. Let us choose two factors \(\gamma \) and \(\gamma _1\) from (0, 1) and define

$$\begin{aligned} \delta = \gamma \mu , \quad \hat{\tau } \; = \; \gamma _1 \delta . \end{aligned}$$
(6.7)

Then, \({\delta + 2 \hat{\tau } \over \delta - \hat{\tau }} = {\mu +\delta \over \mu - \delta }\) if and only if \({1 + 2 \gamma _1 \over 1 - \gamma _1} = {1+\gamma \over 1 - \gamma }\). Thus, we can choose \(\gamma _1 = {2 \gamma \over 3 - \gamma }\). Then,

$$\begin{aligned} \delta = \gamma \mu , \quad \hat{\tau } = {2 \gamma ^2 \mu \over 3 - \gamma }, \quad \kappa _2^* = {1 + \gamma \over 1 - \gamma }. \end{aligned}$$
(6.8)

Thus, under the choice of parameters (6.8), problem (6.1) is very easy for the gradient methods. It can be solved by the usual gradient method up to accuracy \(\hat{\epsilon }\) in the function value in

$$\begin{aligned} O\left( \kappa _2^* \ln {\kappa _2^* \delta _f \over \hat{\epsilon }} \right) \end{aligned}$$

iterations, where \(\delta _f\) is the initial residual in function value. If we apply the fast gradient methods, the efficiency estimate is even better:

$$\begin{aligned} O\left( [\kappa _2^*]^{1/2} \ln {\kappa _2^* \delta _f \over \hat{\epsilon }} \right) \end{aligned}$$

(see Section 2 in Nesterov 2004). Note that the cost of each iteration of such schemes is O(KmN) operations. Thus, we get indeed very efficient clustering procedures.

Let us look at the rule of forming the soft membership vectors \(p_i(X) \in \Delta _{K}, i = 1, \ldots N\), imposed by the potential \(\Phi _2\). For that, given the position matrix X, we need to minimize the potential \(\Phi _2(X,\cdot )\):

$$\begin{aligned} \min \limits _{P \in \Delta _{K}^N} \left\{ {1 \over N} \langle G_1(X), P\rangle _{{\mathbb {E}}_2} + {\mu \over 2N} \Vert P \Vert ^2_{{\mathbb {E}}_2} \right\} . \end{aligned}$$

This means that for each element \(i, 1 \le i \le N\), we need to solve the problem

$$\begin{aligned} \min \limits _{p_i \in \Delta _{K}} \left\{ \langle g_i(\bar{X}), p_i \rangle _{{\mathbb {R}}^{K}} + {1\over 2} \mu \Vert p_i \Vert ^2_{2} \right\} , \end{aligned}$$
(6.9)

where \(g_i^{(k)}(\bar{X}) = \rho _2(v_i,\bar{x}_k), k = 1, \ldots , K\). Note that the objective function in the latter problem is as follows:

$$\begin{aligned} {\mu \over 2} \left\| p_i + {1 \over \mu } g_i(\bar{X}) \right\| ^2_2 - {1 \over 2 \mu } \Vert g_i(X) \Vert ^2_2. \end{aligned}$$

Thus, the solution of this problem is a Euclidean projection of the point

$$\begin{aligned} \tilde{p}_i = -{1 \over \mu } g_i(\bar{X}) \end{aligned}$$

onto the standard simplex \(\Delta _{K}\). This point can be found in \(O(K \ln K)\) operations since we need to order the entries of vector \(\tilde{p}_i\). If K is not big, the complexity of straightforward ordering \(O(K^2)\) is also acceptable. Note that for \(\mu \rightarrow 0\) the rule (6.9) converges to the deterministic choice of the entry of vector \(g_i(X)\in {\mathbb {R}}^{K}\) with the minimal value.