An Optimal Subgradient Algorithm with Subspace Search for Costly Convex Optimization Problems

This paper presents an acceleration of the optimal subgradient algorithm OSGA (Neumaier in Math Program 158(1–2):1–21, 2016) for solving structured convex optimization problems, where the objective function involves costly affine and cheap nonlinear terms. We combine OSGA with a multidimensional subspace search technique, which leads to a low-dimensional auxiliary problem that can be solved efficiently. Numerical results concerning some applications are reported. A software package implementing the new method is available.


Introduction
Over the past few decades, solving convex optimization with smooth or nonsmooth objectives has received much attention due to many applications in the fields of applied sciences and engineering, cf. [15,50]. For smooth problems, first-and second-order information is typically available and many first-and second-order methods exist, see [37,44]. However, for nonsmooth problems, usually only first-order information is available. Solving nonsmooth problems is commonly harder than solving smooth problems; however, there are many nonsmooth problems with nice structure such that  this structure can be used to design efficient methodologies for them. Because of the low memory requirement, first-order methods are especially important for problems with a large number of variables.
Subgradient methods constitute a class of first-order methods that have been developed since 1960 to solve convex nonsmooth optimization problems, see, e.g., [46,49]. In general, they only need function values and subgradients, have low memory requirement, and can be used for solving convex optimization problems with several millions of variables. However, too many iterations are needed to attain a very accurate solution. The low convergence speed of subgradient methods corresponds to their complexity (the number of iterations required to attain an ε-solution for a given ε > 0). In 1983, Nemirovski and Yudin [36] proved that the worst-case complexity bound to achieve an ε-solution of problems with a Lipschitz continuous convex nonsmooth objective by first-order methods is O(ε −2 ), while it is O(ε −1/2 ) for smooth problems with Lipschitz continuous gradients.
Algorithms attaining the optimal worst-case complexity bound for a class of problems are called optimal. Historically, optimal first-order methods for smooth convex optimization date back to Nesterov [38] in 1983. He later in [40,41] proposed two gradient-type methods for minimizing a sum of two functions (composite problems) with the optimal complexity, where, for the first method, the smooth part of the objective needs to have Lipschitz continuous gradients and, for the second one, the smooth part of the objective needs to have Hölder continuous gradients. Since 1983 many researchers have studied optimal first-order methods; see, e.g., Auslander and Teboulle [7], Beck and Teboulle [11], Devolder et al. [22], Gonzaga et al. [26,27], Lan [30], Lan et al. [31], Nesterov [37,39,40], and Tseng [52]. Moreover, Nemirovski and Yudin in [36] showed that the subgradient, subgradient projection, and mirror descent methods attain the complexity O(ε −2 ) for Lipschitz continuous nonsmooth objectives, so that they are optimal for this class of problems. Recently, Neumaier in [42] proposed a subgradient algorithm called OSGA, which attains both the optimal complexity O(ε −1/2 ) for smooth problems with Lipschitz continuous gradients and the optimal complexity O(ε −2 ) for Lipschitz continuous nonsmooth problems. It is notable that OSGA does not need to know about global information of objective functions such as Lipschitz constants and behaves well for problems arising in applications, see Ahookhosh and Neumaier [1,[4][5][6].
A multidimensional subspace search scheme is a generalization of line search techniques, which are one-dimensional search schemes for finding a step-size along a specific direction. Hence, in multidimensional subspace search, one searches a vector of step-sizes allowing the best combination of several search directions for optimizing an objective function. Generally, subspace search techniques form a class of descent methods, where they can be used independently or employed as an accelerator inside of iterative schemes to attain a faster convergence. The pioneering work of subspace optimization was proposed in 1969 for smooth problems by Miele and Cantrell [33] and Cragg and Levy [21] who defined a memory gradient technique based on a subspace of the form S = span{−g k , d k−1 }, where g k denotes the gradient of the function at x k and d k−1 is the last available direction. Since then, many subspace search schemes have been proposed by selecting various search directions, see, e.g., [18,20,51] and references therein. Depending on the selected search directions used for constructing a subspace, two classes of subspace methods are distinguished, namely, gradient-type techniques [21,23,34] and Newton-type schemes [28,32,53,54]. Content In this paper we propose an accelerated version of OSGA (called OSGA-S) for solving convex optimization problems involving costly linear operators and cheap nonlinear terms. Our new method is a two-stage method that solves unconstrained nonsmooth convex optimization problems of the form where for i = 1, . . . , p ( p n), f i : U i → R is a (non)smooth, proper, and convex function, and A i : V → U i is a linear operator, for real finite-dimensional vector spaces V, U i . Solving (1) with OSGA involves two key steps, namely, providing the first-order information and solving an auxiliary high-dimensional subproblem (Eq. (5) below). Since the problem (1) is unconstrained, the exact solution of the corresponding auxiliary problem is given in a closed form, cf. [1,42]. In many applications involving overdetermined systems of equations and classification with support vector machine (see Sects. 4.1 and 4.2), the objective function has the form (1) involving costly affine but cheap nonlinear terms. Hence the most costly parts in computing function values and subgradients are related to applying forward and adjoint operators. We therefore try to improve in each iteration the current best point by an inner iteration solving a low-dimensional version of the original problem, using a subspace composed from the best point and the last few iterations. We emphasize that applying the subspace search involves no additional costly forward and adjoint operators. Therefore, the subspace search stage does not impose a significant cost to the outer scheme OSGA while improves its performance considerably. As proved in [42], this does not affect the worst-case complexity of the algorithm if done in the right place. However, our numerical results show that it successfully reduces the number of iterations and the running time needed in practice.
Similar to OSGA, OSGA-S needs to know about no global information except the strong convexity parameter μ (μ = 0 if it is not available), and it only requires the first-order information; however, the main advantage of OSGA-S is being able to handle problems with complex structure of the form (1) involving composition of several functions and linear operators. Such structured problems have received much attention due to increase of interest in using mixed regularization terms, e.g., [8].
However, if f i , i = 1, . . . , p, are nonsmooth, then smooth solvers, Nesterov-type optimal methods [37,38,40], and proximal splitting methods [11,19] are not able to handle the problem. In this case subgradient methods [14] and Nesterov's universal gradient method with the level of smoothness parameter ν = 0 [41] can deal with the problem. Note that the mirror descent methods [12,36] can only handle the constrained version of (1). On the other hand, to the best of our knowledge there is only little work involving subspace search techniques for nonsmooth optimization problems [23,35], where they are based on smoothing the objective functions so that can not be used in our numerical comparison.
For high-dimensional problems involving dense matrices, applying OSGA-S with a multidimensional subspace search results in a substantial gain in the running time, despite the extra effort needed for applying the subspace optimization. Indeed, the inner level runs OSGA-S only on a low-dimensional unconstrained auxiliary problem in an adaptive multidimensional subspace, and the associated solution is used to accelerate the outer level of OSGA-S iteration on the original problem. The multidimensional subspace uses some previously computed directions and results in a low-dimensional problem with typically at most 20 variables. Numerical experiments and comparison with subgradient methods and the universal gradient method show that the subspace search can significantly accelerate OSGA, especially when the objective involves costly linear operators.
The remainder of this paper is organized as follows. In the next section we briefly review the main idea of OSGA. Section 3 describes a combination of OSGA and a multidimensional subspace search. Numerical results are reported in Sect. 4, and some conclusions are given in Sect. 5. Notations Let V be a real finite-dimensional vector space endowed with the norm · , and V * denotes its dual space, which is formed by all linear functional on V where the bilinear pairing g, x denotes the value of the functional g ∈ V * at x ∈ V. If V = R n , then, for 1 ≤ p ≤ ∞, The set of all subgradients is called the subdifferential of f at x denoted by ∂ f (x).
We denote by f x and g x , the function value f (x) and the subgradient g at x ∈ C, respectively.

A Review of OSGA
In this section we briefly review the main idea of the optimal subgradient algorithm (see Algorithm 1) proposed by Neumaier in [42] for solving the convex constrained minimization problem where f : C → R is a proper and convex function defined on a nonempty, closed, and convex subset C of V. OSGA is a subgradient algorithm for problem (2) that uses first-order information, i.e., function values and subgradients, to construct a sequence of iterations {x k } ∈ C whose sequence of function values { f (x k )} converge to the minimum f = f ( x) with the optimal complexity. OSGA requires no information regarding global parameters such as Lipschitz constants of function values and gradients. In the unconstrained version relevant for the present work, we have C = V, and we work with a quadratic prox-function Q(z) : , where x 0 ∈ V is a given starting point and Q 0 an appropriate positive constant. Let us denote by g Q (x) the gradient by Q at x. At each iteration, OSGA satisfies the bound on the currently best function value f (x b ) with a monotonically decreasing error factor η that is guaranteed to converge to zero by an appropriate steplength selection strategy (see Procedure PUS ). Note that x is not known, thus the error bound is not fully constructive, but enough to guarantee the convergence of f (x b ) to f with a predictable worst-case complexity. To maintain (3), OSGA considers linear relaxations of f at z, where γ ∈ R and h ∈ V * , updated using linear underestimators available from the subgradients evaluated (see Algorithm 1). For each such linear relaxation, OSGA solves a maximization problem of the form where Let (5). From (4) and (6), we obtain Setting η := E(γ b , h) in (7) implies that (3) is valid. If x b is not optimal then the right inequality in (7) is strict, and since Q(z) ≥ Q 0 > 0, we conclude that the maximum η is positive.
In each step, OSGA uses the next scheme for updating the given parameters α, h, γ , η, and u, see [42] for more details.
Procedure PUS(parameters updating scheme) if η < η then 10 h ← h; γ ← γ ; η ← η; u ← u; 11 end 12 end Algorithm 1: OSGA (optimal subgradient algorithm) update the parameters α, h, γ , η and u using PUS; In [42], it is shown that the number of iterations to achieve an ε-optimum is of the optimal order O ε −1/2 for a smooth f with Lipschitz continuous gradients and of the order O ε −2 for a Lipschitz continuous nonsmooth f . The algorithm has low memory requirements so that, if the subproblem (5) can be solved efficiently, OSGA is appropriate for solving large-scale problems. Numerical results reported by Ahookhosh in [1,3] for unconstrained problems, and by Ahookhosh and Neumaier in [4][5][6] for simply constrained problems show the good behavior of OSGA for solving practical problems.
Note that there is a flexibility in choosing In the next section we give a two-stage scheme (called OSGA-S) that the outer stage is OSGA and the inner stage is a multidimensional subspace search used to produce a suitable point increasing a significant computational cost to the outer stage OSGA for solving the problem (1).

Structured Convex Optimization Problems
In this paper we consider the convex optimization problem (1), which appears in many applications such as signal and image processing, machine learning, statistics, data fitting, and inverse problems; see, e.g., [15,50].
In many applications, the objective function of (1) involves expensive linear mappings (equivalently matrix-vector products with dense matrices). To apply a first-order method for minimizing such problems, the first-order oracle (function values and subgradients) should be available, i.e., Hence, in each call of the first-order oracle, p forward operators A i , i = 1, . . . , p, and p adjoint operators A * i , i = 1, . . . , p, must be applied requiring O(n 2 ) operations. This computationally leads to overall expensive function and subgradient evaluations such that the total cost of using a first-order method is dominated by the cost of applying forward and adjoint linear operators. This motivates the quest for developing an acceleration of OSGA using a multidimensional subspace search for solving such problems.
The primary idea of multidimensional subspace methods is to restrict the next iteration to a low-dimensional subspace by constructing a subproblem with a reduced dimension. Let us fix M n, where n is the number of variables. Let the sequence {x k } k≥0 be generated by In this case a direction d belongs to the subspace S if and only if there exist constants where U : is a matrix constructed from the directions considered and t = (t 1 , t 2 , . . . , t M ) T is a vector of coefficients. Afterwards, the M-dimensional minimization problem is considered to determine the best possible vector of coefficients t, where v i := A i x and V i := A i U . The minimization problem (10) shows that the procedure of searching the best possible direction of the form (9) in the subspace (8) generalizes the idea of exact line search, see, e.g., [44], but it provides an approximate minimization. One can also construct the subspace Then the subspace minimization is defined by Since M n, the minimization subproblems (10) and (12) are low-dimensional and can be solved efficiently by classical optimization methods. Hence subspace search techniques can be implemented extremely fast. This leads to suitable schemes for large-scale optimization as the number of variables of practical problems growing up. Moreover, using a multidimensional subspace search as an inner step of iterative schemes needs low memory, which may be considerably cheaper than performing one step of the algorithm in the full dimension. Further, many common ideas in nonlinear optimization can be considered as multidimensional subspace search techniques, namely conjugate gradient, limited memory quasi-Newton, and memory gradient methods; see, e.g., [18,23,54].
Motivated by the above-mentioned discussion, the multidimensional subspace search scheme can be outlined as follows: Algorithm 2: MDSS (multidimensional subspace search) (10) or (12) inexactly to find t * ; To implement Algorithm 2 successfully, some factors are crucial: (i) the number of directions M controlling the computational cost of the scheme; (ii) choosing suitable directions to construct the subspaces; (iii) solving the minimization problem (10) or (12) efficiently. Indeed, for choosing the number of directions M, there is a trade-off between the total computational cost per iteration and the amount of possible decrease in function values.
We here use MDSS as an accelerator of OSGA for solving problems involving costly linear operators. More precisely, we save some previously computed points, construct a subspace of the form (8) and apply MDSS to find a point x b in Line 8 of OSGA. This typically gives us a better point x b in Line 9 of OSGA. In the next subsection, we will show how the subspace S is constructed and how the subproblem (12) can be solved efficiently at a reasonable cost.

Solving the Auxiliary Problem (12) by OSGA
In this section we show how one can construct a suitable subspace of the form (11) and how to solve the auxiliary problem (12) with OSGA. Without loss of generality, we here assume . . , p, U : j and (V i ) : j denote the jth column of the matrices U and V i , respectively. Let us consider a variant of OSGA using the multidimensional subspace search technique as follows: Algorithm 3: OSGA-S (optimal subgradient algorithm with subspace search) Input: global parameters: δ, α max ∈ ]0, 1[, 0 < κ ≤ κ; local parameters: In OSGA-S if the number of iterations is less than M, we save the points x and x and related vectors v These points and the best iteration so far (x b ) are used to construct the subspace If the number of iterations is larger than or equal to M, we use the subspace (13) and solve a subspace problem of the form (12) with t ∈ R 2M+1 , i.e., This possibly leads us to a better point x b than that provided in Line 8 of OSGA. Note that if the number of iterations is bigger or equal than M, OSGA-S is a two-stage algorithm, where the outer stage is OSGA and the inner stage is a subspace search in Line 15. In the next result we show that Let also the points be generated by the former iterations of OSGA-S to construct the subspace (13). Then each step of OSGA applied to (14) in MDSS needs 4 pm(2M + 1) operations. In step k of OSGA-S, we have and Then we have ) . This means that the construction of V ik , i = 1, . . . , p, has no extra cost if they have been saved in the outer scheme of OSGA-S. We now compute the first-order oracle at t by Computing each of V ik t and V * ik ∂ f i (V ik t), i = 1, . . . , p, needs m(2M +1) operations. Therefore, apart from the cost of nonlinear terms, we need 2 pm(2M + 1) operations in each call of the first-order oracle for the problem (14). Since OSGA requires two calls of the first-order oracle in each iteration, we need 4 pm(2M + 1) operations in each iteration of MDSS.
By (15) This implies that x k , x k , x b ∈ S, leading to Let t * ∈ R 2M+1 be the minimizer of the subspace problem (14) associated to the subspace (13). By (18) and setting x b = U k t * , we can write giving the result.
Note that if U k and V ik , for i = 1, . . . , p, are collected in the outer stage of OSGA-S, then no extra efforts for computing them are needed in applying the subspace search scheme MDSS (see Lines 5,8,10, and 17 of OSGA-S). Let us assume m ≈ n. Then Theorem 1 implies that in each step of OSGA for solving (14) one needs O(n) operations. Therefore, applying n step of OSGA to (14) have the complexity the same as one call of the oracle for the full-dimensional problem by the outer scheme. Since we suppose n is a large number, the cost of applying n 0 (n 0 n) steps of OSGA to (14) can be ignored in comparison to the cost of a single call of the first-order oracle in the full dimension. Hence MDSS can be applied efficiently to accelerate OSGA without imposing too much computational cost for large-scale objectives involving expensive linear operators and cheap nonlinear terms.
Theorem 1 implies that OSGA-S is a special case of OSGA obtained by specializing the choice of Line 8 in OSGA. Therefore, all theoretical feature of OSGA remains valid. Therefore, OSGA-S is optimal for smooth problems with Lipschitz continuous gradients, Lipschitz continuous nonsmooth problems, and strongly convex problems. We summarize this result in the next theorem that was proved in [42]. We compare OSGA-S with OSGA, SGA-1 (a non-summable diminishing steplength subgradient algorithm, cf. [14]), SGA-2 (a non-summable diminishing step-size subgradient algorithm, cf. [14]), and NESUN (Nesterov's universal gradient method, cf. [41]). In our implementation, SGA-1 and SGA-2 use the following step-sizes

Overdetermined Linear System of Equations
Consider the overdetermined linear system of equations where x ∈ R n is an unknown vector, A ∈ R m×n with m > n, y ∈ R m is an observation vector, and ν ∈ R m is unknown but small an additive noise. The objective is to recover x from y by solving (19). Such problems appear in many applications, see, e.g., [9,10,16]. They are of particular interest for robust fitting of linear models to data. In practice, this problem is typically ill-posed, cf. [43]. Therefore, x is usually computed by a minimization problem of the form (1) with one of the objective functions of Table  1.
Here, we set where m = 50000 and n = 5000. Since some of the problems given in Table 1 involve regularization terms that NESUN subproblem cannot be solved efficiently (e.g., · ∞ ), we will not consider it in this comparison. We therefore use SGA-1, SGA-2, OSGA, and OSGA-S for solving this overdetermined system of equations. We set α 0 = 8 × 10 −1 for SGA-1, use α 0 = 10 −4 for SGA-2 if it applies to the problems L22R, L22L22R, L22L1R, L1R, L1L22R, and L1L1R, and exploit α 0 = 2 × 10 −2 for SGA-2 if it applies to the problems L2R, L2L22R, L2L1R, LIR, LIL22R, and LIL1R. Note that SGA-2 is very sensitive to the parameter α 0 for different problems, so we Table 1 List of minimization problems for solving overdetermined systems of equations, where λ denotes the regularization parameter Function Name Function Name The objective functions are convex and contain a linear mapping A, which is typically a dense matrix and y is defined by (19)  tunned α 0 to attain the best performance of SGA-2 for the considered set of problems. For all problems of Table 1, we set λ = 1.
We first conduct an experiment on the parameter M to find an optimal range for this parameter. To this end, we consider the problems of Table 1, solve the problem by OSGA in 100 iterations, save the best function value f s in each case, and run OSGA-S with M = 1, 2, . . . , 20 to achieve f s . The results of our experiment are summarized in Table 2 and Figs. 1 and 2. In Table 2, the best parameter M best for each problem regarding the best number of iterations and the best running time, along with the results for M best , M = 2, and M = 20, is reported. Figures 1 and 2  From the results of Table 2 and Figs. 1 and 2, it can be seen that M best is varied for the considered problems; however, the interval [1, 5] seems to be statistically reasonable for the parameter M. In addition, it is clear that the performance of OSGA-S depends on the parameter M, but if we set M ∈ [1, 5], OSGA-S outperforms OSGA except for L1R, L1L22R, and L1L1R (see figures (g), (h), and (i) of Fig. 2).
We now solve the problems reported in Table 1 by SGA-1, SGA-2, OSGA, and OSGA-S, where we first solve these problems by OSGA in 100 iterations, save the best function value f s and stop the others whenever they attain a function value less or equal than f s or the number of iterations reaches to the maximum number of iterations, which is 500 here. We set M = 2 for OSGA-S. The results of implementation are summarized in Table 3 and Fig. 3.
In Table 3, N and T denote the number of iterations and the running time, respectively. The results of Table 3 show that OSGA and OSGA-S outperform SGA-1 and SGA-2 significantly regarding both the number of iterations and the running time; however, OSGA-S needs fewer iterations and less running time than OSGA. In Fig.  Table  1, where N (M) denotes the total number of iterations 3, we illustrate the relative error of function values versus iterations, i.e., where f 0 , f k , and f denote the function values at a starting point x 0 , the current point x k , and the minimizer x, respectively. The results of Fig. 3 show that in many cases OSGA-S get the same accuracy in fewer iterations and less running time; however, for  Table 1, where T (M) denotes the running time some cases such as L1R, L22L1R, and L1L1R the difference between the number of iterations of OSGA-S and OSGA is not significant and worse running time are attained by OSGA-S. Moreover, the results of OSGA-S is much better than SGA-1, SGA-2, and OSGA for LIR and LIL1R that might be because of poor sparse subgradients of the infinity norm · ∞ for SGA-1, SGA-2, and OSGA, while the subspace minimization step of OSGA-S involves a combination of several former points resulting to better directions.  . 3 The relative error of function values δ k against iterations for SGA-1, SGA-2, OSGA, and OSGA-S for solving overdetermined systems of equations using the minimization problems presented in Table 1

Support Vector Machines
The learning with support vector machines (SVM) leads to several expensive convex optimization problems with large dense data set. Some of these problems have the form designed in this paper. Let us consider a binary classification, where a set of training data (x 1 , y 1 ), . . . , (x q , y q ) in which x i ∈ R n and y i ∈ {−1, 1} for i = 1, . . . , q is given. The aim is to find a classification rule using the training data, so that for a new point x one can assign a class y ∈ {−1, 1} to x by the derived classification rule. The classification rule for SVM is given by the sign of x, w + w 0 , where w and w 0 may be determined by solving a penalized problem where [z] + = max{z, 0}, and ψ can be · 1 (SVML1R), · 2 2 (SVML22R), and 1 2 · 2 2 + · 1 (SVML22L1R) (see, e.g., [13,48,55] and references therein). For x, w = w T x, let us define Then the problem (21) can be rewritten in the form where [1 − A w] + = max{1 − A w, 0} and 1 ∈ R q is the vector of all ones. Typically A is a dense matrix constructed by data points x i and y i for i = 1, . . . , q. It is clear that (22) is of the form (1), where an associated subgradient g is given by In order to show the benefit of our subspace technique for this kind of problems, we apply SVML1R, SVML22R, and SVML22L1R to the leukemia data given by Golub et al. in [24], available in [25]. This dataset comes from a study of gene expression in two types of acute leukemias (acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL)) and it consists of 38 training data points and 34 test data points. We apply SVML1R, SVML22R, and SVML22L1R to the training data points (q = 38 and n = 7129) with six levels of regularization parameters. We first solve the problems by OSGA in 1000 iterations and save the best function value f s in each case. We then run SGA-1, SGA-2, NESUN, OSGA, and OSGA-S, where they are stopped after 5000 iterations or after achieving a function value at least as good as f s . The associated results are summarized in Table 4 and Figs. 4 and 5.
The results of Table 4 show that OSGA and NESUN are comparable but better than SGA-1 and SGA-2, and OSGA-S outperforms all others significantly with respect to the number of iterations (N ) and the running time (T ) (the best average is given by OSGA-S). In Figs. 4 and 5, we illustrate the function values versus iterations indicating that OSGA-S needs few iterations (typically less than 35 iterations) to get the accuracy that OSGA attains in 1000 iterations and SGA-1, and SGA-2 get in few thousands of iterations; however, this number of iterations is varied from about 100 to few thousands for NESUN for different problems. This shows a good potential of OSGA-S to be applied to machine learning problems.
We now consider the accuracy (the ratio of the number of correctly predicted data labels to the total number of data multiplied by 100) of OSGA-S for solving (22)   N , T , and f b denote the number of the function values, the running time, and the best function value achieved by the associated algorithm. In each problems, the best iteration, time (in second), and function value are displayed as bold     The results are summarized in Table 5. From the results of this table, it is clear that in many cases the accuracy of OSGA-S is increased by considering a bigger number of iterations; however, it produces acceptable results after 50 iterations. In addition, it can be seen that the regularization parameter plays a crucial role in the accuracy of OSGA-S. Among state-of-the-art SVM solvers, we here compare the accuracy of OSGA-S with LIBSVM [17], FITCSVM (MATLAB internal function), PEGASOS [47], SVMperf [29] with their default parameters to solve (22). In our implementation, LIBSVM, FITCSVM, SVMperf, and PEGASOS attain the accuracies 65.27, 79.17, 79.17, and 69.44, respectively. A comparison among these accuracies with those reported in Table  5 shows that the accuracy of OSGA-S is comparable or even better than LIBSVM, FITCSVM, SVMperf, and PEGASOS for the considered data set. Since the number of training and testing data for the leukemia data set [24,25] is small, we consider a comparison among OSGA-S and LIBSVM, FITCSVM, SVMperf, and PEGASOS for w1a-w8a data sets [45]. After training procedure, we consider a concatenation of the training and testing data and apply the derived classification functions, where the obtained accuracy for each solver is reported in Table 6. In this table, OSGA-S-1, OSGA-S-2, and OSGA-S-3 stand for OSGA-S for the problems SVML1R, SVML22R, and SVML22L1R, where we tune the regularization parameter to get the best performance of OSGA-S (see the numbers in parentheses of the last three columns of Table 6) and stop OSGA-S after 50 iterations. The results of Table 6 show that the accuracy of OSGA-S is almost comparable with those of state-of-the-art solvers LIBSVM, FITCSVM, SVMperf, and PEGASOS.

Table 5
The accuracy of OSGA-S for several levels of the regularization parameters after various number of iterations for solving SVML1R, SVML22R, and SVML22L1R Problem name   Table 6 The accuracy of LIBSVM, FITCSVM, SVMperf, PEGASOS, OSGA-S-1, OSGA-S-2, OSGA-S-3 for solving the problem (22). The number of features for all data is 300, and TrD and TeD denote the number of training and testing data, respectively.

Conclusions
In this paper we give an iterative scheme for solving convex optimization problems involving costly linear operators with cheap nonlinear terms. More precisely, we combine OSGA with a multidimensional subspace search, which leads to solve a sequence of low-dimensional subproblems that can be solved efficiently by OSGA. Numerical results for overdetermined system of equations and support vector machines show the efficiency of the scheme proposed.