A proximal subgradient algorithm with extrapolation for structured nonconvex nonsmooth problems

In this paper, we consider a class of structured nonconvex nonsmooth optimization problems, in which the objective function is formed by the sum of a possibly nonsmooth nonconvex function and a differentiable function with Lipschitz continuous gradient, subtracted by a weakly convex function. This general framework allows us to tackle problems involving nonconvex loss functions and problems with specific nonconvex constraints, and it has many applications such as signal recovery, compressed sensing, and optimal power flow distribution. We develop a proximal subgradient algorithm with extrapolation for solving these problems with guaranteed subsequential convergence to a stationary point. The convergence of the whole sequence generated by our algorithm is also established under the widely used Kurdyka–Łojasiewicz property. To illustrate the promising numerical performance of the proposed algorithm, we conduct numerical experiments on two important nonconvex models. These include a compressed sensing problem with a nonconvex regularization and an optimal power flow problem with distributed energy resources.


Introduction
In this work, we consider the structured optimization problem min x∈C F (x) := f (x) + h(Ax) − g(x), (P) where C is a nonempty closed subset of a finite-dimensional real Hilbert space H, A is a linear mapping from H to another finite-dimensional real Hilbert space K, f : H → (−∞, +∞] is a proper lower semicontinuous (possibly nonsmooth and nonconvex) function, h : K → R is a differentiable (possibly nonconvex) function whose gradient is Lipschitz continuous with modulus , and g : H → (−∞, +∞] is a continuous weakly convex function with modulus β on an open convex set containing C. This broad optimization problem has many important applications in diverse areas, including power control problems [12], compressed sensing [30], portfolio optimization, supply chain problem, image segmentation, and others [26]. In particular, the model problem (P) covers two of the most general models in the literature.Firstly, in statistical learning, the following optimization model is often used min x∈R d (ϕ(x) + γ r(x)) , (1) where ϕ is called a loss function which measures the data misfitting, r is a regularization which promotes specific structure in the solution such as sparsity, and γ > 0 is a weighting parameter.Typical choices of the loss function are the least square loss function ϕ(x) =1 2 Ax − b 2 where A ∈ R m×d and b ∈ R m and the logistic loss function, which are both convex.In the literature, nonconvex loss functions have also received increased attentions.Some popular nonconvex loss functions include the ramp loss function [19,48] and the Lorentzian norm [41].In addition, [2] recently showed that many regularization r used in the literature can be written as difference of two convex functions, and so, the model ( 1) can be formulated into problem (P).These include popular regularizations such as the smoothly clipped absolute deviation (SCAD) [4], the indicator function of cardinality constraint [18], L 1 − L 2 regularization [30], or minimax concave penalty (MCP) [49].Therefore, problem (P) can be interpreted as a problem with the form (1) whose objective function is the sum of a nonconvex and nonsmooth loss function and a regularization which can be expressed as a specific form of difference-of-(possibly) nonconvex functions 1 .Secondly, in the case when C = R d and A is the identity mapping, problem (P) reduces to min referred as the general difference-of-convex (DC) program, which is a broad class of optimization problems studied in the literature.To solve problem (2) under the convexity of g, a generalized proximal point algorithm was developed in [3].For the case when both f and g are convex, [39] provided an accelerated differenceof-convex algorithm incorporating Nesterov's acceleration technique into the standard difference-of-convex algorithm (DCA) to improve its performance, while [28] proposed an inexact successive quadratic approximation method.When f , h, and g are all required to be convex, a proximal difference-of-convex algorithm with extrapolation (pDCAe) was proposed in [47], and there are also other existing studies (e.g., [32,33]) that developed algorithms to solve such a problem.
In the cases where the loss function f is smooth and the regularization r is prox-friendly in the sense that its proximal operator can be computed efficiently, the proximal gradient method is a widely used algorithm for solving (1) (for example, see [7]).Moreover, incorporating information from previous iterations to accelerate the proximal algorithm while trying not to significantly increase the computational cost has also been a research area which receives a lot of attention.One such approach is extrapolation technique.In this approach, momentum terms that involve the information from previous iterations are used to update the current iteration.Such techniques have been successfully implemented and achieved significant results, including Polyak's heavy ball method [40], Nesterov's techniques [37,36], and the fast iterative shrinkingthreshold algorithm (FISTA) [8].In particular, extrapolation techniques have shown competitive results for optimization problems that involve sum of convex functions [6], difference of convex functions [47,33], and ratio of nonconvex and nonsmooth functions [11].
In view of these successes, this paper proposes an extrapolated proximal subgradient algorithm for solving problem (P).In our work, comparing to the literature, the convexity and smoothness of the loss functions f are relaxed.We also allow a closed feasible set C instead of optimizing over the whole space.This general framework allows us to tackle problems involving nonconvex loss functions such as Lorentzian norm and problems with specific nonconvex constraints such as spherical constraint.We then prove that the sequence generated by the algorithm is bounded and any of its cluster points is a stationary point of the problem.We also prove the convergence of the full sequence under the assumption of Kurdyka-Łojasiewicz property.We then evaluate the performance of the proposed algorithm on a compressed sensing problem for both convex and nonconvex loss functions together with the recently proposed nonconvex L 1 − L 2 regularization.Finally, we formulate an optimal power flow problem considering photovoltaic systems placement, and address it using our algorithm.
The rest of this paper is organized as follows.Section 2 provides preliminary materials used in this work.In Section 3, we introduce our algorithm with guaranteed subsequential convergence and full sequential convergence.Section 4 presents the numerical experiments, and conclusion is given in Section 5.

Premilinaries
Throughout this paper, H is a finite-dimensional real Hilbert space with inner product •, • and the induced norm • .We use the notation N for the set of nonnegative integers, R for the set of real numbers, R + for the set of nonnegative real numbers, and R ++ for the set of the positive real numbers. Let and it never takes the value −∞, lower semicontinuous if its epigraph is a closed set, and convex if its epigraph is a convex set.We say that f is and the limiting subdifferential of f at x is defined by where the notation In the case where |f (x)| = +∞, both Fréchet subdifferential and limiting subdifferential of f at x are defined to be the empty set.The domain of ∂ L f is given by dom ∂ L f := {x ∈ H : ∂ L f (x) = ∅}.It can be directly verified from the definition that the limiting subdifferential has the robustness property Next, we revisit some important properties of the limiting subdifferential.Lemma 2.1 (Sum rule).Let x ∈ H and let f, g : H → (−∞, +∞] be proper lower semicontinuous functions.Suppose that f is finite at x and g is locally Lipschitz around x. Proof.This follows from [35,Proposition 1.107(ii) and Theorem 3.36].
The following result, whose proof is included for completeness, is similar to [13, Lemma 2.9].

Lemma 2.2 (Upper semicontinuity of subdifferential). Let
By the Lipschitz continuity of f around x, there are a neighborhood V of x and a constant V ∈ R + such that f is Lipschitz continuous on V with modulus V .Then, by [35,Corollary 1.81], for all v ∈ V and v * ∈ ∂ L f (v), one has v * ≤ V .Since x n → x as n → +∞, there is n 0 ∈ N such that, for all n ≥ n 0 , x n ∈ V , which implies that x * n ≤ V .This means (x * n ) n∈N is bounded.Now, let x * be a cluster point of (x * n ) n∈N , i.e., there exists a subsequence (x * kn ) n∈N such that x * kn → x * as n → +∞.On the other hand, we have from the convergence of (x n ) n∈N and the Lipschitz continuity of f around x that x kn f → x.Therefore, x * ∈ ∂ L f (x) due to the robustness property of the limiting subdifferential.
We end this section with the definitions of stationary points for the problem (P).A point x ∈ C is said to be a Here A * is the adjoint mapping of the linear mapping A.

Proximal subgradient algorithm with extrapolation
We now propose our extrapolated proximal subgradient algorithm for solving problem (P) with guaranteed convergence to stationary points.
Step 3. If a termination criterion is not met, set n = n + 1 and go to Step 2.
Remark 3.2 (Discussion of the algorithm structure and extrapolation parameters).Some comments on Algorithm 3.1 are in order.
(i) Recalling that the proximal operator of a proper function φ : H → (−∞, +∞] is defined by we see that the update of x n+1 in Step 2 can be written as This can be done efficiently for various specific structures of f and C. For example, when f is a convex quadratic function and C is a polyhedral set, computing the proximal operator of τ n (f +ι C ) is equivalent to solving a convex quadratic programming problem.When f is a nonconvex quadratic function and C is the unit sphere, this reduces to a trust region problem which can be solved as a generalized eigenvalue problem or a semi-definite programming problem.In addition, the proximal operator can also have closed form solution for some nonconvex and nonsmooth functions, e.g., f (x) = x 1 − α x with α ∈ R + (see [30,Lemma 1]).For further tractable cases, see, e.g., [11,Remark 4.1].
(iii) In the case where h ≡ 0, the objective function F reduces to f − g and the update of x n+1 in Step 2 reduces to In turn, if C = H and µ n = 0, Algorithm 3.1 reduces to the proximal linearized algorithm proposed in [44] which requires that f and g are convex.
(iv) When g ≡ 0, A is the identity mapping, and C = H, the objective function reduces to f + h.By choosing λ n = µ n , we have and Algorithm 3.1 reduces to the inertial forward-backward algorithm studied in [6], in which an additional requirement of the convexity of h is imposed.
(v) Motivated by the popular parameter used in FISTA and also its variants [7, Chapter 10], a plausible option for extrapolation parameters λ n and µ n (which will be used in our computation later) is that . It can be seen that, for all n ∈ N, 1 ≤ κ n−1 < κ n + 1, and so λ n ∈ [0, λ] and µ n ∈ [0, µτ n ].We can also reset κ n−1 = κ n = 1 whenever n is a multiple of some fix integer n 0 .
From now on, let (x n ) n∈N be a sequence generated by Algorithm 3.1.Under suitable assumptions, we show in the next theorem that (x n ) n∈N is bounded and any of its cluster points is a stationary point of problem (P).

Theorem 3.3 (Subsequential convergence). For problem (P), suppose that the function F is bounded from below on C and that the set
Then the following statements hold: and the sequence (F (x n )) n∈N is convergent.
(ii) The sequence (x n ) n∈N is bounded and x n+1 − x n → 0 as n → +∞.
(iii) Suppose that lim inf n→+∞ τ n = τ > 0 and let x be a cluster point of , and x is a lifted stationary point of (P).Moreover, x is a stationary point of (P) provided that g is strictly differentiable on an open set containing C ∩ dom f .

Proof. (i) & (ii):
We see from Step 2 of Algorithm 3.1 that, for all n ∈ N, x n ∈ C and Therefore, for all n ∈ N and all x ∈ C, or equivalently, By the Lipschitz continuity of ∇h, we derive from [36, Lemma 1.
As g n ∈ ∂ L g(x n ) and x n , x n+1 ∈ C, it follows from the weak convexity of g and [11, Lemma 4.1] that Letting x = x n ∈ C in (4) and combining with the last two inequalities, we obtain that By the definition of u n and v n , we have , and so where we have used ). Rearranging terms yields , we have Since δ > 0, the sequence (F n ) n∈N is nonincreasing.Since F is bounded below on C, the sequence (F n ) n∈N is bounded below, and it is therefore convergent.After rearranging (5) and performing telescoping, we obtain that, for all m ∈ N, Denoting F := lim n→+∞ F n and letting m → +∞, we obtain that Therefore, as n → +∞, x n+1 − x n → 0, and so Now, we observe that (iii): As x is a cluster point of the sequence (x n ) n∈N , there exists a subsequence (x kn ) n∈N of (x n ) n∈N such that x kn → x as n → +∞.Then x ∈ C and, since x n+1 − x n → 0, one has x kn−1 → x, so as u kn−1 and v kn−1 .Since g + β 2 • 2 is a continuous convex function on an open set O containing C, we obtain from [42, Example 9.14] that g is locally Lipschitz continuous on O.In view of Lemma 2.2, since x kn → x as n → +∞, passing to a subsequence if necessary, we can assume that g kn → g ∈ ∂ L g(x) as n → +∞.
Replacing n in (4) with k n − 1, we have for all n ∈ N and all x ∈ C that As lim inf n→+∞ τ n = τ > 0, letting x = x and n → ∞, we obtain that f (x) ≥ lim sup n→+∞ f (x kn ).Since f is lower semicontinuous, it follows that lim n→+∞ f (x kn ) = f (x).On the other hand, lim n→+∞ g(x kn ) = g(x) and lim n→+∞ h(Ax kn ) = h(Ax) due to the continuity of g and h.Therefore, Next, by letting n → +∞ in (6), for all x ∈ C, which can be rewritten as This means x is a minimizer of the function x is a lifted stationary point of (P).
In addition, if we further require that g is strictly differentiable, then Lemma 2.1 implies that x is a stationary point of (P).
Next, we establish the convergence of the full sequence generated by Algorithm 3.1.In order to do this, we recall that a proper lower semicontinuous function G : H → (−∞, +∞] satisfies the Kurdyka-Łojasiewicz (KL) property [24,29] at x ∈ dom ∂ L G if there are η ∈ (0, +∞], a neighborhood V of x, and a continuous concave function φ : [0, η) → R + such that φ is continuously differentiable with φ > 0 on (0, η), φ(0) = 0, and, for all We say that G is a KL function if it satisfies the KL property at any point in dom ∂ L G. If G satisfies the KL property at x ∈ dom ∂ L G, in which the corresponding function φ can be chosen as φ(t) = ct 1−θ for some c ∈ R ++ and θ ∈ [0, 1), then G is said to satisfy the KL property at x with exponent θ.The function G is called a KL function with exponent θ if it is a KL function and has the same exponent θ at any x ∈ dom ∂ L G. Theorem 3.4 (Full sequential convergence).For problem (P), suppose that F is bounded from below on C, that the set where c = 1 2 ( A 2 λ + µ), and suppose that G satisfies the KL property at (x, x) for every x ∈ C ∩ dom f .Then (i) The sequence (x n ) n∈N converges to a stationary point x * of (P) and +∞ n=0 x n+1 − x n < +∞.(ii) Suppose further that G satisfies the KL property with exponent θ ∈ [0, 1) at (x, x) for every x ∈ C ∩ dom f .The following statements hold: (a) If θ = 0, then (x n ) n∈N converges to x * in a finite number of steps.
(b) If θ ∈ (0, 1  2 ], then there exist γ ∈ R ++ and ρ ∈ (0, 1) such that, for all n ∈ N, Proof.For each n ∈ N, let z n = (x n+1 , x n ).According to Theorem 3.3, we have that, for all n ∈ N, that the sequence (z n ) n∈N is bounded, that z n+1 − z n → 0 as n → +∞, and that for every cluster point z of (z n ) n∈N , z = (x, x), where x ∈ C ∩dom f is a stationary point of (P) and Let n ∈ N. It follows from the update of x n+1 in Step 2 of Algorithm 3.1 that Since lim inf n→+∞ τ n = τ > 0, there exists n 0 ∈ N such that, for all n ≥ n 0 , τ n ≥ τ /2.Recalling that λ n ≤ λ and µn τn ≤ µ, we have for all n ≥ n 0 that dist(0, where η 1 = g + A + 2 τ + 4c and η 2 = A λ + µ.Now, the first conclusion follows by applying [11, Theorem 5.1] with , and ε n ≡ 0. The remaining conclusions follow a rather standard line of argument as used in [5,11,27], see also [10,Theorem 3.11].

Remark 3.5 (KL property and KL exponents).
In the preceding theorem, the convergence of the full sequence generated by Algorithm 3.1 requires the KL property of the function G with the form that G(x, y) := F (x) + ι C (x) + c x − y 2 , where F is the objective function of the model problem (P), C is the feasible region of problem (P) and c > 0. We note that this assumption holds for a broad class of model problem (P) where F is a semi-algebraic function and C is a semi-algebraic set.More generally, it continues to hold when F is a definable function and C is a definable set (see [24,9]).
As simple illustrations, in our case study in the next section, we will consider the following two classes of functions: , where c > 0 and C is a semi-algebraic set in R d .Then, in both cases, G is definable, and so, it satisfies the KL property at (x, x) for all x ∈ C ∩ dom F .Moreover, for case (ii), if C is further assumed to be a polyhedral set, then as shown in [27] the KL exponent for G is 1  2 , and by Theorem 3.4, the proposed algorithm exhibits a linear convergence rate.

Case studies
In this section, we provide the numerical results of our proposed algorithm for two case studies: compressed sensing with L 1 − L 2 regularization, and optimal power flow problem which considers photovoltaic systems placement for a low voltage network.All of the experiments are performed in MATLAB R2021b on a 64-bit laptop with Intel(R) Core(TM) i7-1165G7 CPU (2.80GHz) and 16GB of RAM.

Compressed sensing with L 1 − L 2 regularization
We consider the compressed sensing problem min where A ∈ R m×d is an underdetermined sensing matrix of full row rank, γ ∈ R ++ , and α ∈ R ++ .Here, ϕ can be the least square loss function and the Lorentzian norm loss function mentioned in Remark 3.5.
In our numerical experiments, we let α = 1 to be consistent with the setting in [30].We first start with the least square loss function.By letting ϕ(z) = 1 2 z − b 2 , where b ∈ R m \ {0}, the problem (7) now becomes This is known as the regularized least square problem, which has many applications in signal and image processing [38,23,30].To solve problem (8), we use Algorithm 3.1 with f = γ • 1 , h = ϕ, and g = γ • .Then the update of x n+1 in Step 2 of Algorithm 3.1 reads as where w n = v n − τ n A * (Au n − b) + γτ n g n , and where In this case, the proximal operator is the soft shrinkage operator [8], and so, for all i = 1, . . ., d, For this test case, we compare our proposed Algorithm 3.1 with the following algorithms: • Alternating direction method of multipliers (ADMM) proposed in [30]; • Generalized proximal point algorithm (GPPA) proposed in [3]; • Proximal difference-of-convex algorithm with extrapolation (pDCAe) in [47].
For our proposed algorithm, δ = 5 × 10 −25 , λ = 0.1, µ = 0.01, τ n = 1/ 2δ + A 2 2λ + 1 + 2µ with A being spectral norm, and . Here, we adopt the well-known restarting techniques (see, for example, [7,Chapter 10]) and reset κ n−1 = κ n = 1 every 50 iterations.Note that this technique has been utilized in several existing work such as [11,47].We generate the vector b based on the same method as in [30].In generating the matrix A, we use both randomly generated Gaussian matrices and discrete cosine transform (DCT) matrices.For each cases, we consider different matrix sizes of m × d with sparsity level s as given in Table 1.For the ground truth sparse vector x g , a random index set is generated and non-zero elements are drawn following the standard normal distribution.The stopping condition for all algorithms is  In Table 2, we report the CPU time, the number of iteration, and the function values at termination, the error to the ground truth at termination, averaged over 30 random instances.It can be observed that since the Step 2 involves the calculation of matrix multiplication, the CPU time is significantly increased with the increasing dimension of the matrices.In addition, in terms of running time, objective function values, the number of iterations used, and the error with respect to the ground truth solution (defined as ), our proposed algorithm outperforms ADMM and GPPA in all test cases.Our algorithm also appears to be comparable to pDCAe.Note that our algorithm can be applied to a more general framework than the others.
. Lorentzian norm can be useful in robust sparse signal reconstruction [41].In this case, the optimization problem (7) becomes We note that is Lipschitz continuous with modulus = 2. Since the loss function is now nonconvex and the pDCAe algorithm in [47] requires a convex loss function, pDCAe is not applicable in this case.Moreover, the ADMM algorithm in [30] is also not directly applicable due to the presence of the Lorentzian norm.Therefore, we compare our method with the GPPA only.For GPPA, we let h(x) = ϕ(Ax).The stepsize for GPPA is τ = 0.8/(2λ max (A T A)).For this case, we set γ = 0.001 and run the GPPA and our proposed algorithm, which are both initialized at the origin, for a maximum of 4000 iterations.The remaining parameters of our algorithm are set to the same values as before.We also use 30 random instances of the previous 8 test cases.The results are presented in Table 3.It can be seen from Table 3 that the proposed algorithm outperforms GPPA in this case.

Optimal power flow considering photovoltaic systems placement
Optimal power flow (OPF) is a well-known problem in power system engineering [1].The integration of many distributed energy resources (DERs) such as photovoltaic systems, has become increasingly popular in modern smart grid [45], leading to the needs of developing more complicated OPF models considering the DERs.Metaheuristic algorithms are popular in solving OPF, and they have also been applied to solve the OPF with DERs integration [43,22].However, the drawbacks of the metaheuristic algorithms are that the convergence proof cannot be established, and their performances are not consistent [14].Difference-ofconvex programming has also been successfully applied to solve the OPF problem in [34], although DERs are not considered.Motivated by the aforementioned results, in this work we try to applied our proposed algorithm to solve the OPF in a low voltage network, which includes optimizing the placement of photovoltaic (PV) systems.We formulate two models which are based on the Direct Current OPF (DC OPF) [21], and Alternating Current OPF (AC OPF) [15].To the best of the authors' knowledge, this is the first time a proximal algorithm is used to solve an DER-integrated OPF with a difference-of-convex formulation, considering PV systems placement.The objective function aims at minimizing the cost of the conventional generator, which is a diesel generator in this case study, while maximizing the PV-penetration, which is defined as the ratio of the power generated by the PV systems divided by the total demand.The network considered in this case study is illustrated in Figure 1, which consists of 14 buses.This case study is taken from a real low voltage network in Victoria, Australia.Currently, there are demands at bus 1, 3, 4, 6, 8, 9, 13, and 14.There are 6 PV systems at bus 1, 2, 4, 5, 7, and 8 with a capacity of 800 kW.A 5000 kW diesel generator is connected to bus 11.All of the parameters and decision variables in this case study are presented in Table 6.The cost of the current situation (before optimization is performed) is based on the cost of active power withdrawn from the generator, plus the installation cost of the PV systems.To determine this initial cost, the amount of active power generated by the generator is determined via DIgSILENT Power Factory 2021.After that, the cost of active power is calculated by the expression i∈M (a(P G i ) 2 + bP G i + c), plus the installation cost of the six PV systems.
We first formulate the OPF problem with PV, which is based on the DC OPF, as follows j∈N,j =i Taking into account of the above equivalence, a plausible alternative optimization model for the OPF problem with PV is as follows The objective function (11a) aims at minimizing the installation cost and the generation cost of the diesel generator and maximizing the PV penetration, which is defined as Di [20], the parameter γ > 0 which serves as a Lagrangian multiplier for the discrete constraints X i ∈ {0, 1}.
With this reformulation, the objective function (11a) now becomes a difference-of-convex function.Constraint (10b) describes the relationship between the power flow from one bus to another and their corresponding phasor angles, constraint (10c) defines the voltage angle at the slack bus, which is the bus connected to the diesel generator, constraint (10d) and (10e) define the power flow in and out of any buses, constraint (10f) ensures that the PV penetration rate is at least 50 percent, constraint (10g) defines the transmission limits of the transmission lines, and constraint (10h) makes sure that the solar power only exists at a bus when there is a PV system at that bus.Finally, constraint (10i) defines the boundaries of the remaining decision variables.All of the constraints form the feasible set S. This problem takes the form of (P) with f = ι S , Di , and g = γ i∈N (X 2 i − X i ).By Remark 3.5(ii) and Theorem 3.4, in this case the proposed algorithm converges with a linear rate.The update of x n+1 in Algorithm 3.1 becomes Here, 14 , P G 11 , X 1 , . . ., X 14 , θ 1 , . . ., θ 14 , P 1,1 , . . ., P 1,14 , . . ., P 14,1 , . . ., P 14,14 ] T .This step is solved by MATLAB's quadprog command.Noting that β = 0 (since g is convex) and = 2a, the parameters are set as follows: δ = 5 × 10 −25 , λ = 0.1, µ = 0.01, τ n = 1/ 2δ + 2λ + 1 + 2µ , µ n = µτ n , and λ n is chosen in the same way as in Section 4.1.The performance of the proposed algorithm is compared with the the GPPA, and the pDCAe, as illustrated in Table 4.We use the step size τ n = 0.8/ for GPPA, and τ n = 1/ for pDCAe.The maximum number of iteration is 1000, and the stopping condition is the same as the one used in Section 4.1.
We test all algorithms for 30 times, at each time we use a random starting point between the upper bound and the lower bound of the variables.The mean objective function values, and the best objective function values found by all algorithms are reported in Table 4.Although the proposed algorithm, on average, needs more iterations than the remaining ones, it can find a better solution.The mean objective function value found by our algorithm is also better than the ones found by the other algorithms.Our algorithm is also comparable to the GPPA and the pDCAe in terms of average CPU time.
The details of the best solution found by our algorithm are shown in Figure 1.Now we consider the case of AC OPF model.The formulation is based on the branch flow model given in [15].Firstly, the network is treated as a directed graph, as shown in Figure 2. We denote a directed link by (i, j) or i → j if it points from bus i to bus j, and the set of all directed links by E. Next, the formulation is given as follows, P 0,11 = Q 0,11 = 0 (12d) The main differences between the AC OPF model and the DC OPF model are that the AC OPF model has a nonconvex feasible set, and that it also accounts for the loss in the network as well as the reactive power.Consequently, AC OPF is more accurate than DC OPF in practice [17], and due to its nonconvexity, it is also more challenging to solve [31].Constraints (12e) → (12h) define the power flow in any directed links.Constraint (12i) ensures that the PV penetration rate is at least 50 percent.Constraints (12j) and (12k) ensure that the active and reactive power from PV systems only exist at a bus if and only if there is a PV system at that bus.Constraint (12l) describes the relationship between the voltage of any two bus in a directed link.Constraint (12m) is a nonconvex constraint ensuring that the solution have physical meaning.Finally, constraints (12n) → (12t) define the boundaries of the decision variables.The update of x n+1 is also the same as before.For this case, We also perform the same numerical experiment as in the DC OPF case.However, the pDCAe is not applicable in this case, so we compare our algorithm with the GPPA only.The parameters of GPPA and our proposed algorithm are set to the same values as those used for the DC OPF model.Due to the nonconvex constraint, MATLAB's fmincon is used to solve the subproblem in Step 2 instead of quadprog.The results are shown in Table 5.Table 5 shows that our proposed algorithm takes less time and fewer iterations than the GPPA to converge.The best solution found by our algorithm in this case is also the same as the one found in the DC OPF model.
It can be seen that for both DC OPF and AC OPF, two PV systems need to be installed at bus 7 and bus 9, the remaining demands can be supplied by the generator, and the demands are satisfied by the power flows.Although the mathematical model aims at maximizing the PV penetration, drawing power from the diesel generator is still more economical due to the high installation cost of the PV systems.The solution significantly reduces the cost by approximately 70% from the original situation.This can serve as a proof of concept for future research.

Conclusion
We have proposed an extrapolated proximal subgradient algorithm for minimizing a class of structured nonconvex and nonsmooth optimization problems.Our algorithm allows less restriction on the smoothness and convexity requirements for establishing convergence proof.In addition, our choice of the extrapolation parameters is flexible enough to cover the popular one used in FISTA and its variants.The convergence of the whole sequence generated by our algorithm is proved via the abstract convergence framework given in [11].The proposed algorithm exhibits very competitive results in terms of numerical experiments which are performed on a compressed sensing problem with nonconvex L 1 − L 2 regularization, compared with some existing algorithms.We have also applied this algorithm to solve an OPF problem considering PV placement, which serves as a proof of concept for future works.

A. Data of Case study 4.2
In Case study 4.2, we use a base power of 100 MVA, and a base voltage of 22 kV.All of the parameters are converted into Per Unit (pu) values in the calculation.Readers can refer to [46,Chapter 2] for a detailed tutorial on the Per Unit system.The active and reactive power demand are given in Table 7.The other technical parameters of the system including susceptance, resistance, and reactance of the lines are given in Table 8, Table 9, and Table 10, respectively.Current between bus i and bus j, i, j ∈ N , i = j θ i Voltage angle of bus i, i ∈ N P ij Active power flow between bus i and bus j, i, j ∈ N , i = j Q ij Reactive power flow between bus i and bus j, i, j ∈ N , i = j xn+1−xn xn< 10 −8 .

Figure 1 :
Figure 1: Best solution found in Case study 4.2, together with the total cost.

Figure 2 :
Figure 2: Directed graph representation of the network.
The research of TNP was supported by Henry Sutton PhD Scholarship Program from Federation University Australia.The research of MND benefited from the FMJH Program Gaspard Monge for optimization and operations research and their interactions with data science, and was supported by a public grant as part of the Investissement d'avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.The research of GL was supported by Discovery Project 190100555 from the Australian Research Council.
weakly convex with modulus α, then f is said to be weakly convex on C with modulus α.Some examples of weakly convex functions are quadratic functions, convex functions, and differentiable functions with Lipschitz continuous gradient.
Let x ∈ H with |f (x)| < +∞.The Fréchet subdifferential of f at x is defined by

Table 1 :
Test cases for Case study 4.1

Table 2 :
Results of 30 random generated instances for 8 test cases -Least square loss function Next, we consider the case of Lorentzian norm loss function by letting ϕ

Table 3 :
Results of 30 random generated instances for 8 test cases -Lorentzian norm loss function

Table 5 :
Comparison of GPPA and the proposed algorithm on 30 runs of the AC OPF model

Table 6 :
Parameters and variables of Case study 4.2

Table 7 :
Active and Reactive power demand