Soft clustering by convex electoral model

In this paper, we suggest a new technique for soft clustering of multidimensional data. It is based on a new convex voting model, where each voter chooses a party with certain probability depending on the divergence between his/her preferences and the position of the party. The parties can react on the results of polls by changing their positions. We prove that under some natural assumptions this system has a unique fixed point, providing a unique solution for soft clustering. The solution of our model can be found either by imitation of the sequential elections, or by direct minimization of a convex potential function. In both cases, the methods converge linearly to the solution. We provide our methods with worst-case complexity bounds. To the best of our knowledge, these are the first polynomial-time complexity results in this field.


Motivation
Cluster analysis is one of the most important and natural tools in Data Mining. It was immediately addressed with the first steps in the computer sciences. It is enough to mention the first publications in the middle of twentieth century (see Forgy 1965;Lloyd 1057;MacQueen 1967;Steinhaus 1957). The first proposed algorithms had a combinatorial nature. Given the set of observations v i 2 R m ; i ¼ 1; . . .; N, and pre-defined number of clusters K, we can define the total variance of the partition: where S ¼ fS 1 ; . . .; S K Þ is a partition of the set f1; . . .; Ng, and the center of kth cluster is given by a least-square estimate: Thus, it looks natural to find the best clustering by solving the following problem: min S Var ðSÞ: ð1:3Þ The first combinatorial greedy method for minimizing the function Var ðÁÞ was proposed by Lloyd (1057) and reinvented by Forgy (1965). The algorithm stops when further improvement of the objective function is no longer possible. This approach is usually addressed as hard K-means clustering.
However, it was discovered soon in Garey et al. (1982) that the problem (1.3) is NP-hard even for K ¼ 2 (see also Aloise et al. 2009;Dasgupta and Freund 2009;Kleinberg et al. 1998). In order to get rid of the combinatorial nature of problem (1.3), in Dunn (1973), Bezdek (1981), it was suggested to use fuzzy (or soft) clustering. In this approach, the positions of the centers C ¼ ðc 1 ; . . .; c K Þ of the clusters become variables. At the same time, each element v i participates in cluster k up to a certain degree p ðkÞ i . The smaller this membership value is, the lower should be the probability that the item i must be attributed to cluster k. Usually, the dependence of these values on the central positions of the clusters is given by the following expressions: where the parameter r [ 1 is called the fuzzifier of the system. In the absence of a reliable information on its reasonable value, the usual choice is r ¼ 2. Thus, in order to compute so-called C-means soft clustering, we need to solve the following nonlinear minimization problem: min C2R mÂK For r [ 1, this can be done by the standard algorithms of general nonlinear optimization. However, note that function Var p ðÁÞ can have very complicated topological structure. So, the only possible theoretical guarantee for these algorithms is the convergence to local minima, which are most probably not unique. At this moment, there exist hundreds of different algorithms for computing hard and soft clusterings (see, for example, the surveys Bora and Gupta 2014; Peters et al. 2013;Xu and Wunsch 2005;Yang 1993). However, to the best of our knowledge, all of them a heuristic. None of them is supported by rigorous theoretical statements on its efficiency and the quality of the generated solutions. Such a situation does not look very striking since both variants of clustering problems (1.3) and (1.4) look computationally difficult in view of their nonconvexity.
The main goal of this paper is to show that a small change in the problem setting makes our goal computationally tractable. In order to explain the origin of our approach, let us try to find some successful examples of clustering in the big systems, existing in real life. 1 For our purposes, the most interesting could be a proper understanding of electoral procedures in modern democratic states.
Since the political parties must reflect the aggregate preferences of a big group of voters, this is indeed a good example of clustering in real life-which has already proved its efficiency by two hundred years of elections in the USA. Note that this procedure should have an enormous stabilizing effect, which allows the country to recover quickly after external and internal shocks and remain functioning in accordance with the interests and experience of the society. 2 In Operations Research, behavior of a stable system is usually explained by existence of a convex potential function, which must be minimized by the natural evolution of the system. Thus, one of the main goals of this paper is a construction of a convex model for sequential elections, which can help in soft clustering of voters, taking into account their internal preferences. In our model, the voter is characterized by a set of opinions (features), which can be compared with the positions of parties. The opinions of voters are fixed. However, the positions of parties may be changed in accordance with observed opinions of attracted voters. At the same time, party cannot go too far from its basic declarations (core values). For each voter, we compute the probabilities for selecting each party. These probabilities are adjusted after each round of consecutive elections. Thus, in our approach probabilities justify a computational tool for approaching a good clustering.
Our main result is that under some natural behavioral assumptions, the electoral system is stable. This means that the sequential election procedure converges to a unique fixed point. Moreover, this convergence is fast (linear). Thus, we get a computationally efficient algorithm for computing soft clusters. It has a form of an alternating minimization method for a convex potential function. From the technical point of view, the main novelty of our approach consists in a complete elimination of the least squares principle [compared with (1. where l [ 0 is a volatility coefficient. Note that the norms in this expression are not squared.

Contents
The paper is organized as follows: In Sect. 2, we develop all necessary mathematical tools for analyzing the global convergence of an alternating minimization scheme as applied to a bivariate potential function with a multiplicative interaction term. We prove sufficient conditions for its convexity and establish a linear rate of convergence for alternating minimization.
1 It is not the first time when the nature helps in finding an efficient optimization algorithm. May be the most striking example is a derivation of the shortest-path Dijkstra's algorithm from the least time Fermat principle of propagation of light. However, now we cannot discuss this interesting topic in details.
In Sect. 3, we describe our electoral model and introduce the main behavioral assumptions for the voters and political parties. We define the stable systems as systems for which the sequential election procedure has a unique fixed point. In Sect. 4, we justify the conditions for stability, using bivariate potential function studied in Sect. 2. It is important that for measuring divergence in the opinions of voters and positions of the parties we can use an arbitrary Lipschitz continuous function. In any case, the rate of convergence of the whole process is linear.
In Sect. 5, we rewrite the stability condition of the system in terms of the individual tolerance of the party and discuss the computational cost of the proposed model. In the simplest case of measuring the divergence in the opinions by a usual norm, this cost is proportional to Oð 1 2 Þ, where is the desired accuracy of the computed solution. However, we can use smooth approximations of the norms, or distances based on Huber function. In this case, the computational cost of our model is reduced. It becomes proportional to ln 1 multiplied by some factors dependent on the parameters of our model.
In Sect. 6, we analyze a clustering scheme based on a direct minimization of the bivariate potential. For doing that, we replace the probabilistic assignment rule (1.5) by a Euclidean projection onto the standard simplex. As a consequence, we can apply to our model the standard gradient methods of smooth convex optimization with linear rate of convergence. Consequently, we get the worstcase estimate for the total computational expenses of the order OðKmN ln 1 Þ arithmetic operations. This is the fastest clustering scheme in our paper. Recall that for the best of our knowledge, in our paper we present the first complexity results in this field.

Notation and generalities
In this paper, we denote by E with an optional subscript a finite-dimensional real vector space, and by E Ã the corresponding dual space. For linear function s 2 E Ã , denote its value at x 2 E by hs; xi E . If no ambiguity arise, the subscript is usually omitted. For any norm k Á k E in E, we define its dual norm in the standard way, Notation R n þ is used for the positive orthant: and by D n we denote the standard simplex: The standard notation is used for ' p norms: ; x 2 R n ; p ! 1; with kxk 1 ¼ max For function f ðxÞ; x 2 E, we denote by rf ðxÞ 2 E Ã its gradient at x. If f is a non-differentiable convex function, the same notation is used for its subgradient. If function f ðÁ; ÁÞ has two arguments, notation r 2 f ðx; yÞ corresponds to its gradient with respect to variable y. Finally, for twice differentiable function f we denote by r 2 f ðxÞ its Hessian at x. For x being fixed, this is a linear operator from E to E Ã .
Recall that function f is called strongly convex on convex set Q E with convexity parameter r [ 0 if for all x; y 2 Q and a 2 ½0; 1 we have f ðax þ ð1 À aÞyÞ af ðxÞ þ ð1 À aÞf ðyÞ À r 2 að1 À aÞkx À yk 2 E :

ð1:8Þ
The most important property of strongly convex function is that it attains a unique minimum on any closed convex set Q. Therefore, if x Ã ¼ arg min x2Q f ðxÞ, then for any x 2 Q and a 2 ð0; 1Þ we have Dividing the difference of the right-and left-hand side of this inequality ð1 À aÞ and tending a to one, we get 2 Bivariate potential with multiplicative interaction cost Let E; E 1 , and E 2 be three finite-dimensional real vector spaces. Consider the following function of two variables: Uðx; pÞ ¼ f ðxÞ þ gðx; pÞ þ hðpÞ; x 2 E 1 ; p 2 E 2 : ð2:1Þ Soft clustering by convex electoral model 17611 where functions f and h are closed and convex on their domains. Our main structural assumption on function U is that the interaction term g has a multiplicative form: gðx; pÞ ¼hG 1 ðxÞ; G 2 ðpÞi E ; In Sect. 4, we will see an application example for a potential having exactly this structure.
The main goal of this section consists in developing natural conditions for convexity of function U and analyzing the performance of alternating minimization approach for solving the corresponding optimization problems.
Denote by Q 1 and Q 2 the closed convex feasible sets in the spaces E 1 and E 2 , respectively. And let Assumption 1 For any x 2 Q 1 function gðx; ÁÞ ¼ hG 1 ðxÞ; G 2 ðÁÞi is closed and convex on Q 2 , and for any p 2 Q 2 function gðÁ; pÞ ¼ hG 1 ðÁÞ; G 2 ðpÞi is closed and convex on Q 1 .
In the remaining part of this section, we always assume that Assumption 1 is satisfied.
Example 1 Assumption 1 is valid, for example, for linear operators: Another important example is gðx; pÞ ¼ hG 1 ðxÞ; pi, where p 2 R n þ and all components of the vector function G 1 ðxÞ are convex.
Assumption 1 has the following important consequence.
Lemma 1 For any two points z 0 ¼ ðx 0 ; p 0 Þ and z 1 ¼ ðx 1 ; p 1 Þ from Q and any a 2 ½0; 1, we have On the other hand, by the same reason Putting all inequalities together, we get (2.3). h Now we can justify a sufficient condition for convexity of the potential UðÁ; ÁÞ.
Theorem 1 Let function f be strongly convex on Q 1 with parameter r 1 , and function h be strongly convex on Q 2 with parameter r 2 . Assume that the operators G 1 ðÁÞ and G 2 ðÁÞ are Lipschitz continuous: If the constants L 1 and L 2 are small enough, namely, if Assuming now that L 2 1 L 2 2 ðr 1 À Þðr 2 À Þ for some small ! 0, we have shows that the condition (2.5) cannot be improved. In this example, L 1 ¼ kAk and L 2 ¼ 1. where the parameters of function U satisfy a strict variant of condition (2.5). In this case, function U is strongly convex. Consequently, there exists a unique solution of the problem (2.6). Let us show that it can be found by the following alternating minimization scheme.

Initialization
: Uðx; p tþ1 Þ: ð2:7Þ In order to analyze the convergence of this scheme, let us define the following operators: Uðx; pÞ 2 Q 1 : In view of Assumption 1, the objective function in both optimization problems above is strongly convex. Hence, the points uðÁÞ and vðÁÞ are well defined.
Lemma 2 Let the conditions of Theorem 1 be satisfied. Then, for any x 1 ; x 2 2 Q 1 we have Similarly, for any p 1 and p 2 from Q 2 we have Proof Indeed, point u(x) is the solution of the problem min p2Q 2 fhG 1 ðxÞ; G 2 ðpÞi þ hðpÞg with strongly convex objective function. Therefore, in view of the property (1.9), we have Adding these two inequalities, we get Thus, in view of inequality (2.4), we get (2.8). The proof of Inequality (2.9) follows by the same argument. h The following theorem is a trivial consequence of Lemma 2. Define TðxÞ ¼ vðuðxÞÞ; SðpÞ ¼ uðvðpÞÞ: Then, TðÁÞ and SðÁÞ are contracting mappings: ð2:10Þ Note that in terms of operators T and S the process (2.7) can be written in the following way: ð2:11Þ Therefore, for any t ! 0, we have A similar rate of convergence can be established for the sequence fp k g k ! 0 . We conclude that the process (2.7) converges linearly to the solution of problem (2.6). Note that from the viewpoint of standard optimization theory, problem (2.6) has possibly non-differentiable objective function with unbounded derivatives (this is allowed by Assumption 1). It has the only favorable property, the strong convexity of the objective. However, this is not enough for justifying linear rate of convergence of any standard optimization scheme.
In the next section, we consider an application example, where the equilibrium state of the system can be found by minimizing bivariate potential with a multiplicative interaction term.

Electoral model
Let us show how we can use the machinery developed in Sect. 2 for justifying a soft clustering model based on a stable electoral procedure. The basic elements of our model are interpreted as independent voters possessing some features (opinions). They must be attached to different clusters (political parties), which represent these features in the best possible way.
In our model, we have N independent voters and K political parties. Our main assumption is that the voting results are random. Voter i decides to vote for party k with probability p ðkÞ i . It is convenient to put these probabilities in a vector These vectors can be unified in a matrix P ¼ ðp 1 ; . . .; p N Þ 2 R KÂN , which we call the voting matrix.
At the beginning of the voting process, this matrix is unknown. However, it will be computed as an outcome of a sequence of consecutive elections. Let us try to explain the results of elections by some quantitative parameters. We assume that an opinion of voter i can be described by m different real values (personal preferences), which we put . These vectors are fixed during the whole history of consecutive elections.
At the same time, positions x k 2 V of political parties are flexible, k ¼ 1; . . .; K. After each round of elections, these values can be adjusted for better representing the positions of the voters closely attached to this party. It will be convenient to keep these vectors in a matrix X ¼ ðx 1 ; . . .; x K Þ 2 R mÂK .
Let us fix some distance function qðv; xÞ ! 0, which is used for measuring the distance between the opinion v 2 V of a voter and current position x 2 V of a political party. In what follows, we always assume that for any v being fixed the function qðv; ÁÞ is convex. A natural choice for this function would be qðv; xÞ ¼ kx À vk, where k Á k is an arbitrary norm in R m . However, in Sect. 5, we will give more examples with some motivation for their use. 3 Clearly, the bigger the distance between v and x is, the smaller should be the probability that this party will be selected by this particular voter. In our electoral model, for this decision we apply the discrete choice probabilities of logit model (e.g., Anderson et al. 1992).
Assumption 2 Voter i selects the kth party with probability ð3:1Þ where l ! 0 is the flexibility parameter, which represents the volatility of opinions of voters.
In Assumption 2, value l ¼ 0 corresponds to the deterministic choice: the voter always chooses a party, which is the closest to his/her opinion. However, usually this parameter is strictly positive.
It is important that the probability vector p i ðXÞ has an optimization interpretation. Consider the entropy function Thus, the entropy function is strongly convex on D K in ' 1norm with convexity parameter one. It remains to describe the behavior of political parties. Each party is able to modify its position in order to attract the maximal number of voters. However, they should not go too far from their core values, which we denote by c k 2 V; k ¼ 1; . . .; K. In order to measure the distance from the current position of the party and its core value, we introduce a prox-function d(x, y). It must satisfy the following conditions: • dðx; yÞ ! 0 for all x; y 2 V. • For each x 2 V, function dðx; ÁÞ is strongly convex in the second argument with convexity parameter one: This function is strongly convex in y in ' 1 -norm. • Euclidean distance d 2 ðx; yÞ ¼ 1 2 kx À yk 2 2 ; x; y 2 R K . This function is strongly convex in ' 2 -norm.
• We can taked i ðx; yÞ ¼ d i ðx; yÞ þ kx À yk i with ! 0; i ¼ 1; 2. The additional linear term gives to a party more chances to keep its core values unchanged.
The behavior of political parties is described by the following assumption.

Assumption
3 For a given voting matrix P ¼ ðp 1 ; . . .; p N Þ 2 D N K , each political party chooses its optimal current position by minimizing the function In this definition, s [ 0 is a group tolerance parameter. 4 The objective function (3.5) has an interpretation of the expected distance between the opinions of all attracted voters and the current position of the kth party, augmented by the discrepancy with its core values.
Consider now the following process of sequential elections: Set X 0 ¼ðc 1 ; . . .; c K Þ: Repeat : P tþ1 ¼P Ã ðX t Þ; X tþ1 ¼ X Ã ðP tþ1 Þ; t ! 0: ð3:6Þ The interpretation of process (3.6) is straightforward: Given the current positions of political parties X t , voters announce their preferences P tþ1 during the electoral poll. After observing the results, parties update their positions X tþ1 for the next elections.
Definition 1 Electoral system is called stable if the process (3.6) has a unique limiting point, which is independent on the starting position X 0 2 V K .
In the next section, we derive some sufficient conditions of the electoral stability. Note that in our model we have two tolerance parameters, l and s.

Stable electoral systems
Let us describe the electoral process (3.6) using the framework of Sect. 2. Denote E 1 ¼ R mÂK the space for positions X ¼ ðx 1 ; . . .; x K Þ of the parties. Then, Q 1 ¼ V K . For the voting matrix P ¼ ðp 1 ; . . .; p N Þ we introduce the space E 2 ¼ R KÂN and the feasible set Q 1 ¼ D N K . For the interaction term, we have E ¼ E 2 ; G 2 ðPÞ P, and gðX; PÞ ¼ hG 1 ðXÞ; Pi E 2 , where the distance matrix G 1 ðXÞ ¼ ðg 1 ðXÞ; . . .; g N ðXÞÞ is component-wise convex in X. Since all matrices P 2 Q 2 have nonnegative elements, function gðÁ; ÁÞ satisfy Assumption 1.
In view of the rules (3.2) and (3.5), the electoral process (3.6) can be seen as alternating minimization scheme (2.7) being applied to the following bivariate potential Thus, the natural choice of the norm for E 1 is as follows: ; U 2 E 1 : ð4:2Þ In this case, the convexity parameter of function f is one, and we can choose r 1 ¼ 1 s . For P 2 Q 2 and direction H 2 E 2 ; we have Therefore, the natural choice for the norm in E 2 is ; H 2 E 2 : In this case, function h has convexity parameter one and we can choose r 2 ¼ l.
It remains to estimate the Lipschitz constants for inequality (2.4). In our case, L 2 ¼ 1. Thus, we get the following dual norm: ; S 2 R mÂN : Now we can get the Lipschitz constant of the mapping G 1 ðÁÞ. For that, we need one more assumption.
Assumption 4 For any v 2 V, function qðv; ÁÞ is Lipschitz continuous in its second argument with constant one.
Then, for X and Y from Q 1 , we have Note that g i ðXÞ ¼ðqðv i ; x 1 Þ; . . .; qðv i ; x K ÞÞ; g i ðYÞ ¼ðqðv i ; y 1 Þ; . . .; qðv i ; y K ÞÞ: Therefore, in view of Assumption 4, Hence, Thus, mapping G 1 is Lipschitz continuous with constant L 1 ¼ N 1=2 . Now, using Theorem 2, we come to the following statement.
Theorem 3 Let the behavior of voters and political parties satisfy Assumptions 2, 3 with tolerance parameters N\ l s : ð4:3Þ Then, the corresponding electoral system is stable. Moreover, for the stationary voting matrix P Ã ¼ lim t!1 P t , where matrices fP t g are generated by the process (3.6), we have kP t À P Ã k E 2 k t N 1=2 ; t ! 0: ð4:4Þ Proof Note that condition (4.3) is a strict version of condition (2.5). In terms of Theorem 2, we have P tþ1 ¼ SðP t Þ and kSðP 1 Þ À SðP 2 Þk kkP 1 À P 2 k 8P 1 ; P 2 2 Q 2 : Note that for any p 1 ; p 2 2 D K we have kp 1 À p 2 k 1 1. Therefore, for any P 2 Q 2 it holds kP À P Ã k N 1=2 . Thus, by Theorem 2, for all k ! 1 we get kP t À P Ã k k t N 1=2 . h

Computational aspects of alternating minimization
Let us find an interpretation of the stability condition (4.3).
For that, we introduce the individual tolerance parameter s ¼ sN: Then, the optimization problem (3.5), defining the response of the party on the voting probabilities, can be rewritten as min

ð5:1Þ
This means that each party reacts on the average opinion of the attracted voters. With this notation, the stability condition k\1 can be rewritten aŝ s\l: ð5:2Þ In view of its importance, we rewrite this condition in form of a theorem.

Theorem 4 Electoral system is stable if the individual tolerance of parties is smaller than the volatility of voters.
This condition looks indeed natural for guaranteeing stability of electoral system: parties must be more conservative in changing their opinions than the voters. In other words, an excessive populism of parties may cause political instability.
Let us discuss now the computational complexity of the electoral' process. For simplicity, we assume that the set V is bounded: D¼ def max x;y2V kx À yk\1. Each step in the recurrence (3.6) consists of two time-consuming operations.
• Computation of the optimal voting matrix P Ã ðXÞ. For that, we need to compute first the distance matrix G 1 ðXÞ. If the cost of computation of the value qðv; xÞ is O(m) operation, then the distance matrix can be computed in O(KmN) operations. After that, we can apply the rule (3.1), which is cheaper: O(KN) operations. Thus, in total we need T 1 ¼ OðKmNÞ operations. • Computation of new positions of the parties X Ã ðPÞ. For that, we need to solve K optimization problems (5.1). Note that just one computation of the objective functions for all these problems needs O(KmN) operations. Therefore, the importance of this step for our approach is dominating. Complexity of problem (5.1) depends on efficiency of the applied optimization schemes. And their potential efficiency depends on the properties of function qðv; ÁÞ. Let us discuss these aspects in more details. Note that the objective function in this problem is strongly convex with parameter r ¼ 1 s : In our approach, in view of Assumption 4, we cannot follow the standard advice to use qðv; xÞ ¼ 1 2 kx À xk 2 2 ; dðc; xÞ ¼ 1 2 kx À ck 2 2 : Recall that this choice is mainly motivated by an extremely low cost of solving the problem (5.1) (only O(mN) operations). However, we pay for that by non-convexity of the potential, existence of many local minima and most probably by NP-hardness of finding the global solution. Thus, we need to choose a Lipschitz continuous distance function. Let us look at the most reasonable variants.
1. Arbitrary norm q 1 ðv; xÞ ¼ kv À xk. This function clearly satisfies Assumption 4. With this choice, problem (5.1) becomes a problem of convex optimization with non-differentiable strongly convex objective function. From the general theory, we know thatsolution of this problem can be found in iterations, where M is an upper bound for the norms of subgradients gradients of functionŵ k ðP; ÁÞ, and is the required accuracy of solving the problem (5.1). Since the objective function of this problem has implicit structure, we can apply the smoothing technique (see Nesterov 2005a, b) for finding an -solution of this iterations. This bound is much better than (5.3). In order to estimate the total computational time, we need to multiply this bound by K, the number of parties, and by mN, the complexity of computing the value and subgradient of functionŵ k ðP; ÁÞ. Thus, we get the following estimate for computational time: Note that this bound depends on two parameters, which can be potentially big. The accuracy must be chosen proportionally of the required accuracy of approximation of matrix P Ã . 5 Thus, the impact of the term 1 ffi ffi p can be significant. The constant M can be estimated from above as follows: Thus, M can be big if the diameter D is big.

ð5:4Þ
Thus, function q 2 ðv; ÁÞ satisfies Assumption 4, and it has Lipschitz continuous gradient with constant 1 d . Consequently, functionŵ k ðc k ; ÁÞ has Lipschitz continuous gradient with the constant L ¼ 1 d þ 1 s . Since this function is strongly convex with parameter 1 s , we conclude that its condition number is equal to Thus, problem (5.1) can be solved by the usual gradient method in O j ln LD 2 iterations. For the fast gradient methods, we get the bound of O j 1=2 ln LD 2 iterations (see Sections 2.1 and 2.2 in Nesterov 2004). For the latter case, we get the following estimate of computational time: Clearly, this estimate is much better than T 2 . Note that Euclidean norm can be smoothed in many different ways. From the statistical point of view, sometimes it is reasonable to apply the Huber function which is used in robust regression for better resistance to outliers: It has the same properties as function q 2 ðv; ÁÞ with the same estimate T 3 for the computational time of one iteration of the process (3.6).
Recall that for estimating the total time, which is necessary for procedure (3.6) to converge to an approximate solution of problem min X;P UðX; PÞ : we need to multiply the bounds T i by an estimate for the total number of steps of the scheme. In view of Theorem 3, in order to get kP k À P Ã k E 2 P , we need at most 1 þ 1 ln 1 k ln N 1=2 ð1 À kÞ P ð5:7Þ iterations of the process (3.6).

Simplified soft clustering
In the previous section, we derived complexity bounds for the iterative process (3.6), which has interpretation of the sequential election procedure. However, the best complexity estimate (5.5), (5.7) is still quite heavy. One of the possibilities for improving the complexity bounds consists in considering direct methods for solving the minimization problem (5.6). Note that in its current form this problem is not simple. The reason is that function hðÁÞ in (4.1) has unbounded derivatives at the feasible set. Therefore, in this section we study a possibility of replacing it by a squared Euclidean norm. Of course, in this case we will loose the probabilistic interpretation of the soft assignment (3.1). However, from the viewpoint of clustering, the new potential is still useful for computing the attachment coefficients p ðkÞ i ðXÞ. Consider the following minimization problem: min X;P f U 2 ðX; PÞ : X 2 V K ; P 2 D N K g; ð6:1Þ where f ðXÞ ¼ 1 2 P K k¼1 kx k À c k k 2 2 ; hðPÞ ¼ 1 2 P N i¼1 kp i k 2 2 , and 6 G 1 ðXÞ ðk;iÞ ¼ q 2 ðv i ; x k Þ; k ¼ 1; . . .; K; i ¼ 1; . . .; N: This potential has three positive parametersŝ; l and d (see Item 2 above), which we assume to satisfy the following relations: s\d\l: ð6:2Þ Let us prove that under these conditions the potential U 2 ðÁ; ÁÞ is a strongly convex function with Lipschitz continuous gradient. Let us fix a direction Z ¼ ðU; HÞ 2 E ¼ E 1 Â E 2 and choose We need to estimate from above and below the second derivative of the potential U 2 ðX; PÞ along direction Z. Note that 7 Since P ! 0, we have At the same time, ð6:3Þ 6 All our further conclusions are also valid for Huber distance function q H ðÁ; ÁÞ. 7 For the interaction term of the potential U 2 , we use the standard differentiation rule ðgðxÞpÞ 00 ¼ g 00 ðxÞðx 0 Þ 2 p þ 2g 0 ðxÞx 0 p 0 þ gðxÞp 00 ; taking into account that x 0 ¼ u; p 0 ¼ h and x 00 ¼ 0; p 00 ¼ 0. Thus, On the other hand, since P ! 0 and function q 2 ðv i ; ÁÞ is convex, we have In view of (6.2), we can choose in E the following Euclidean norm: ð6:5Þ As we have seen, with respect to this norm this norm, potential U 2 is strongly convex on E with convexity parameter one. On the other hand, it has Lipschitz continuous gradient with constant :

ð6:6Þ
Consequently, the condition number of this function is If there is no other reason for choosing the parameter d, satisfying condition (6.2), we can try to use this degree of freedom for making the value j 2 as small as possible. For that, it is convenient to introduce another representation of our parameters. Let us choose two factors c and c 1 from (0, 1) and define  Nesterov 2004). Note that the cost of each iteration of such schemes is O (KmN) operations. Thus, we get indeed very efficient clustering procedures. Let us look at the rule of forming the soft membership vectors p i ðXÞ 2 D K ; i ¼ 1; . . .N, imposed by the potential U 2 . For that, given the position matrix X, we need to minimize the potential U 2 ðX; ÁÞ: : This means that for each element i; 1 i N, we need to solve the problem min where g ðkÞ i ð XÞ ¼ q 2 ðv i ; x k Þ; k ¼ 1; . . .; K. Note that the objective function in the latter problem is as follows: l 2 p i þ 1 l g i ð XÞ 2 2 À 1 2l kg i ðXÞk 2 2 : Thus, the solution of this problem is a Euclidean projection of the point onto the standard simplex D K . This point can be found in OðK ln KÞ operations since we need to order the entries of vectorp i . If K is not big, the complexity of straightforward ordering OðK 2 Þ is also acceptable. Note that for l ! 0 the rule (6.9) converges to the deterministic choice of the entry of vector g i ðXÞ 2 R K with the minimal value.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.