Empirical risk minimization: probabilistic complexity and stepsize strategy

Empirical risk minimization is recognized as a special form in standard convex optimization. When using a first order method, the Lipschitz constant of the empirical risk plays a crucial role in the convergence analysis and stepsize strategies for these problems. We derive the probabilistic bounds for such Lipschitz constants using random matrix theory. We show that, on average, the Lipschitz constant is bounded by the ratio of the dimension of the problem to the amount of training data. We use our results to develop a new stepsize strategy for first order methods. The proposed algorithm, Probabilistic Upper-bound Guided stepsize strategy, outperforms the regular stepsize strategies with strong theoretical guarantee on its performance.


Introduction
Empirical risk minimization (ERM) is one of the most powerful tools in applied statistics, and is regarded as the canonical approach to regression analysis. In the context of machine learning and big data analytics, various important problems such as support vector machines, (regularized) linear regression, and logistics regression can be cast as ERM problems, see for e.g. [17]. In an ERM problem, a training set with m instances, {(a i , b i )} m i=1 , is given, where a i ∈ R n is an input and b i ∈ R is the corresponding output, for i = 1, 2, . . . , m. The ERM problem is then defined as the The work of the authors was supported, in part, by EPSRC Grants EP/M028240/1 and EP/K040723, by where each loss function φ i is convex with a Lipschitz continuous gradient, and the regularizer g : R n → R is a continuous convex function which is possibly nonsmooth. Two common loss functions are -Quadratic loss function: -Logistic loss function: φ i (x) = log(1 + exp(−xb i )).
One important example of g is the scaled 1-norm ω x 1 with a scaling factor ω ∈ R + . This particular case is known as 1 regularization, and it has various applications in statistics [3], machine learning [18], signal processing [6], etc. The regularizer g acts as an extra penalty function to regularize the solution of (1). 1 regularization encourages sparse solutions, i.e. it favors solutions x with few non-zero elements. This phenomenon can be explained by the fact that the 1 norm is the tightest convex relaxation of the 0 norm, i.e. the cardinality of the non-zero elements of x [5].
In general, if the regularizer g is nonsmooth, subgradient methods are used to solve (1). However, subgradient methods are not advisable if g is simple enough, and one can achieve higher efficiency by generalizing existing algorithms for unconstrained differentiable convex programs. Much research has been undertaken to efficiently solve ERM problems with simple g's. Instead of assuming the objective function is smooth and continuously differentiable, they aim to solve problems of the following form min where f : R n → R is a convex function with L-Lipschitz continuous gradient, and g : R n → R is a continuous convex function which is nonsmooth but simple. By simple we mean that a proximal projection step can be performed either in closed form or is at least computationally inexpensive. Norms, and the 1 norm in particular satisfy this property. A function f is said to have a L-Lipschitz continuous gradient if For the purpose of this paper, we denote the matrix A ∈ R m×n to be a dataset such that the i th row of A is a T i , and so in the case of ERM problems, where e i ∈ R m has 1 on its i th component and 0's elsewhere. f is called the empirical risk in ERM. We assume that each φ i have a γ i -Lipschitz continuous gradient and γ max{γ 1 , γ 2 , . . . , γ m }.
Many algorithms [1,4,9,13,20] have been developed to solve (1) and (2). One famous example is the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [1], which is a generalization of the optimal method proposed by Nesterov [10] for unconstrained differentiable convex programs. FISTA, with backtracking stepsize strategy, is known to converge according to the following rate, where x is a solution of (2), and η is the parameter which is used in the backtracking stepsize strategy. The convergence result in (5) contains three key components: the distance between the initial guess and the solution x 0 − x , the number of iterations k, and the Lipschitz constant L. While it is clear that the first two components are important to explain the convergence behavior, the Lipschitz constant, L, is relatively mysterious.
The appearance of L in (5) is due to algorithm design. In each iteration, one would have to choose a constantL to compute the stepsize that is proportional to 1/L, andL has to be large enough to satisfy the properties of the Lipschitz constant locally [1,13]. Since the global Lipschitz constant condition (3) is a more restrictive condition, the Lipschitz constant L always satisfies the requirement ofL, and so L is used in convergence analysis. We emphasize that the above requirement ofL is not unique for FISTA. For most first order methods that solve (2), L also appears in their convergence rates for the same reason.
Despite L being an important quantity in both convergence analysis and stepsize strategy, it is usually unknown and the magnitude could be arbitrary for a general nonlinear function; one could artificially construct a small dimensional function with large Lipschitz constant, and a high dimensional function with small Lipschitz constant. Therefore, L is often treated as a constant [10,11] that is independent of the dimensions of the problem, and so the convergence result shown in (5) is considered to be "dimension-free" because both x 0 − x and k are independent of the dimension of the problem. Dimension-free convergence shows that for certain types of optimization algorithms, the number of iterations required to achieve a certain accuracy is independent of the dimension of the model. For large scale optimization models that appear in machine learning and big data applications, algorithms with dimension-free convergence are extremely attractive [1,2,16].
On the other hand, since L is considered to be an arbitrary constant, stepsize strategies for first order methods were developed independent of the knowledge of L. As we will show later, for adaptive strategies that try to use smallL (large stepsize), extra function evaluations will be needed. If one try to eliminate the extra function evaluations, thenL has to be sufficiently large, and thus the stepsize would be small. This trade-off is due to the fact that L is unknown.
In this paper, we take the first steps to show that knowledge of L can be obtained in the case of ERM because of its statistical properties. For the ERM problem, it is known that the Lipschitz constant is highly related to A [1,14], and so understanding the properties of A is the goal of this paper. If A is arbitrary, then A would also be arbitrary and analyzing A would be impossible. However, for ERM problems that appear in practice, A is structured. Since A is typically constructed from a dataset then it is natural to assume that the rows of A are independent samples of some random variables. This particular structure of A, allows us to consider A as a non-arbitrary but random matrix. We are therefore justified to apply techniques from random matrix theory to derive the statistical bounds for the Lipschitz constant.
The contributions of this paper is twofold: (a) We obtain the average/probabilistic complexity bounds which provide better understanding of how the dimension, size of training set, and correlation affect the computational complexity. In particular, we showed that in the case of ERM, the complexity is not "dimension-free". (b) The derived statistical bounds can be computed/estimated with almost no cost, which is an attractive benefit for algorithms. We develop a novel stepsize strategy called Probabilistic Upper-bound Guided stepsize strategy (PUG). We show that PUG may save unnecessary cost of function evaluations by adaptively choosing L intelligently. Promising numerical results are provided at the end of this paper.
Many research on bounding extreme singular values using random matrix theory have been taken in recent years, e.g. see [8,15,19]. However, we would like to emphasize that developments in random matrix theory is not our objective. Instead, we would like to consider this topic as a new and important application of random matrix theory. To the best of our knowledge, no similar work has been done in understanding how the statistics of the training set would affect the Lipschitz constant, computational complexity, and stepsize.

Preliminaries
This paper studies the Lipschitz constant L of the empirical risk f given in (4). In order to satisfy condition (3), one could select an arbitrarily large L, however, this would create a looser bound on the complexity (see for e.g. (5)). Moreover, L also plays a big role in stepsize strategy for first order algorithms. In many cases such as FISTA, algorithms use stepsize that is proportional to 1/L. Therefore, a smaller L is always preferable because it does not only imply lower computational complexity, but also allows a larger stepsize for algorithms. While the lowest possible L that satisfies (3) is generally very difficult to compute, in this section, we will estimate the upper and lower bounds of L using the dataset A.
Notice that the Lipschitz constant condition (3) is equivalent to the following condition.
Proposition 1 Suppose f is of the form (4), then L satisfies the Lipschitz constant condition (6) with Proof See Proposition 2.1 in [14].
Proposition 1 provides an upper bound for L, where γ is the maximum Lipschitz constant of loss functions, and it is usually known or easy to compute. For example, it is known that γ = 1 for quadratic loss functions, and γ = max i b 2 i /4 for logistics loss functions.
The upper bound of L is tight for the class of ERM problems. We can prove that by considering the example of least squares, where we have In order to derive the lower bound of L, we need the following assumption.

Assumption 2
There exists a positive constant τ > 0 such that The above assumption requires the strongly-convex loss function φ i , which is not restrictive in practical setting. In particular, quadratic loss function satisfies Assumption 2, and the logistics loss function satisfies Assumption 2 within a bounded box [−b, b] for any positive b ∈ R + . With the above assumption, we derive the lower bound of L using A.
Proposition 3 Suppose f is of the form (4) with φ i satisfying Assumption 2 for i = 1, 2, . . . , m, then L satisfies the Lipschitz constant condition (6) with Proof By Assumption 2, for i = 1, 2, . . . , m, Therefore, From Propositions 1 and 3, we bound L using the largest and lowest eigenvalues of A T A. Even though A can be completely different for different dataset, the statistical properties of A can be obtained via random matrix theory.

Complexity analysis using random matrix theory
In this section, we will study the statistical properties of Recall that A is an m × n matrix containing m observations, and each observation contains n measurements which are independent samples from n random variables, i.e. we assume the rows of the matrix A are samples from a vector of n random variables ξ T = (ξ 1 , ξ 2 , . . . , ξ n ) with covariance matrix = E ξξ T . To simplify the analysis, we assume, without loss of generality, that the observations are normalized, and so all the random variables have mean zero and unit variance. Therefore, E[ξ i ] = 0 for i = 1, 2, . . . , n, and the diagonal elements of are all 1's. This assumption is useful and simplifies the arguments and the analysis of this section but it is not necessary. The results from this section could be generalized without the above assumption, but it does not give further insights for the purposes of this section. In particular, this assumption will be dropped for the proposed stepsize strategy PUG, and so PUG is vaild for all the datasets used in practice.

Statistical bounds
We will derive both the upper and lower bounds for the average A 2 , and show that the average A 2 increases nearly linearly in both m and n. The main tools for the proofs below can be found in [19].

Lower bounds
The following Lemma follows from Jensen's inequality and plays a fundamental role on what is to follow.

Lemma 4 For a sequence {Q
Proof For details, see [19].
With Lemma 4, we can derive a lower bound on the expected A T A . We will start by proving the lower bound in the general setting, where the random variables are correlated with general covariance matrix ; then, we will add assumptions on to derive lower bounds in different cases.

Theorem 5 Let
A be an m×n random matrix in which its rows are independent samples of some random variables In particular, if ξ 1 , ξ 2 , . . . , ξ n are some random variables with zero mean and unit variance, then Proof We first try to prove (7). Denote a T i as the i th row of A. We can rewrite A T A as where a k a T k ,'s are independent random matrices with E a k a T k = . Therefore, In order to prove (8), we use the fact that where the last inequality is obtained by applying Jensen's inequality. Therefore, we can write AA T as By the assumption that each entry of A are random variable with zero mean and unit variance, we obtain for i = 1, 2, . . . , m, and for i = j, Therefore, Theorem 5 provides a lower bound of the expected A T A . The inequality in (7) is a general result and makes minimal assumptions on the covariance . Note that the lower bound is independent of n. The reason is that this general setting covers cases where is not full rank: some ξ i 's could be fixed 0's instead of having unit variance. In fact, when all ξ i 's are 0's for i = 1, 2, . . . , n, which implies = 0 n×n , the bound (7) is tight because A = 0 m×n . For the setting that we consider in this paper, Eq. (8) is a tighter bound than (7) and depends on both m and n. In the case where all variables are independent, we could simplify the results above into the following.

Corollary 6 Let A be an m × n random matrix in which its rows are independent samples of some random variables ξ
Proof Since all random variables are independent, = I n and so μ max = λ max ( )= 1.

Upper bounds
In order to compute an upper bound of the expected A T A , we first compute its tail bounds. The idea of the proof is to rewrite the A T A as a sum of independent random matrices, and then use the existing results in random matrix theory to derive the tail bounds of A T A . We then compute the upper bound of the expected value. Notice that our approach for computing the tail bounds, in principle, is the same as in [19]. However, we present a tail bound that is easier to be integrated into the upper bound of the expected A T A . That is, the derived bound can be directly used to bound A T A without any numerical constant.
In order to compute the tail bounds, the following two Lemmas will be used.
Lemma 7 [19] Suppose that Q is a random positive semi-definite matrix that satisfies λ max (Q) ≤ 1. Then where I is the identity matrix in the correct dimension.
Combining the two results from random matrix theory, we can derive the following theorem for the tail bound of A T A .

Theorem 9
Let A be an m × n random matrix in which its rows are independent samples of some random variables ξ T = (ξ 1 , ξ 2 , . . . , ξ n ) for i = 1, 2, . . . , n, and covariance matrix = E ξξ T . Denote μ max = λ max ( ) and suppose Then, for any θ, t ∈ R + , In particular, Proof Denote a T i as the i th row of A. We can rewrite A T A as Notice that a k a T k 's are independent, random, positive-semidefinite matrices, and E a k a T k = , for k = 1, 2, . . . , m. Also, Using the Lemma 8, for any θ > 0, we have Notice that λ max (a k a T k ) ≤ R, by rescaling on Lemma 7, we have, and thus Using standard calculus, the upper bound is minimized when Therefore, For matrices which contain samples from unbounded random variables, assumption (11) in Theorem 9 might seem to be restrictive; however, in practice, assumption (11) is mild due to the fact that datasets that are used in the problem (1) are usually normalized and bounded. Therefore, it is reasonable to assume that an observation will be discarded if its magnitude is larger than some constant.
The tail bound (13) is the tightest bound over all possible θ 's in (12), but it is difficult to interpret the relationships between the variables. The following corollary takes a less optimal θ in (12), but yields a bound that is easier to understand.

Corollary 10
In the same setting as Theorem 9, we have In particular, for ∈ R + , we have Proof Using Eq. (12), and the fact that log(y) ≤ y − 1, ∀y ∈ R + , we have The above upper bound is minimized when θ = (1/R) log [t/(mμ max )], and so The bound in (15) follows directly from (14) and shows that with high probability 1 − (for small ), λ max A T A is less than 2mμ max + R log(n) − R log ( ). Applying the results in Corollary 10 provides the upper bound of the expected A T A .

Corollary 11
In the same setting as Theorem 9, we have Proof Using the Eq. (14), and the fact that we have Therefore, for a matrix A which is constructed by a set of normalized data, we obtain the bound The result in (18) might look confusing because for small m and large n, the lower bound is of the order of n while the upper bound is of the order of log(n). The reason is that we have to take into account the factor of dimensionality in the constant R. To illustrate this, we prove the following corollary.

Corollary 12 Let
and so Proof Since ξξ T is a symmetric rank 1 matrix, we have Therefore, R increases linearly in n for bounded ξ . Recall that the lower bound of the expected A 2 is linear in both m and n, and the upper bound in (20), is almost linear in both m and n. Therefore, our results on the bounds for the expected Lipschitz constant are nearly-optimal up to some constant.
On the other hand, in order to obtain the lower bound of L, we also need tail bound of λ min (A T A), which is provided in the following theorem. Then, if μ min = 0, for any ∈ n exp − mμ min 2nc 2 , n P λ min A T A ≤ mμ min − 2c 2 nmμ min log n ≤ .
Using the Theorem 1.1 from [19], for any θ ∈ (0, 1) we have Notice that θ > 0 and For μ min = 0, we let = n exp − (θ − 1) 2 mμ min 2R and so In particular, suppose ∈ n exp − mμ min 2R , n , then Therefore, For the tail bound in Theorem 13 to be meaningful, m has to be sufficiently large compared to n. In such cases, the smallest eigenvalue λ min A T A is at least O(m − √ nm log n) with high probability.

Complexity analysis
In this section, we will use the probabilistic bounds of L to study the complexity of solving ERM. We focus only on FISTA for illustrative purpose and clear presentation of the idea of the proposed approach. But the approach developed in this section can be applied to other algorithms as well. By the assumption that A is a random matrix, we also have the solution x as a random vector. Notice that the study of randomization of x is not covered this paper. In particular, if the statistical properties of x can be obtained, existing optimization algorithms might not be needed to solve the ERM problem. Therefore, in this paper, we remove this consideration by denoting a constant M such that x 0 − x 2 ≤ M. In such case, we have the FISTA convergence rate where x is the solution of (1), and η > 1 is the parameter which is used in the backtracking stepsize strategy. Using Proposition 1 and Corollary 12, we know Thus, on average, In (22)-(23), the lower bound of (γ /m)E A 2 is linear in n/m, and upper bound is nearly-linear in n/m. This suggests that the average complexity of ERM is bounded by the ratio of the dimensions to the amount of data. In particular, problems with overdetermined systems (m >> n) can be solved more efficiently than problems with underdetermined systems (m < n).
Another critical factor of the complexity is μ max = λ max ( ), where is the covariance matrix of the rows of A. In the ideal situation of regression analysis, all inputs should be statistically linearly independent. In such cases, since we assume the diagonal elements of are 1's, μ max = 1. It is, however, almost impossible to ensure this situation for practical applications. In practice, since ∈ R n×n , μ max = λ max ( ) = is likely to increase as n increases.
Similarly we can compute the probabilistic lower bound of L in the case that m is sufficiently larger than n. Using Theorem 13, we can show that L is bounded above by O μ min − (n log n)/m .
We emphasize the lower bound of L is not equivalent to the lower bound of the complexity. However, since the stepsize of first order method algorithms is proportional to 1/L, this result indicates that the upper bound of the stepsize which potentially could guarantee convergence.

PUG: Probabilistic Upper-bound Guided stepsize strategy
The tail bounds in Sect. 3, as a by-product of the upper bound in Sect. 3.1.2, can also be used in algorithms. As mentioned in the introduction, L is an important quantity in the stepsize strategy since the stepsize is usually inversely proportional to L. However, in large scale optimization, the computational cost of evaluating A 2 is very expensive. One could use backtracking techniques to avoid the evaluation of the Lipschitz constant; in each iteration, we find a large enough constantL such that it satisfies the properties of the Lipschitz constant locally. In the case of FISTA [1], for the k th iteration with incumbent x k one has to find aL such that where, and pL (y) arg min x {QL (x, y) : x ∈ R n }. Equation (24) is identical to the Lipschitz constant condition (6) with specifically y = pL (x k ) and x = x k . Therefore, (24) is a less restrictive condition compared to the Lipschitz constant condition (6). This indicates thatL could be much smaller than L, and so it yields to larger stepsize. On the other hand, forL ≥ L, it is guaranteed that the local Lipschitz constant condition will be satisfied. In both cases, when computing L is intractable, we would not be able to distinguish the two cases by just havingL that satisfies (24). As we can see, finding a goodL involves a series of function evaluations. In the next section, we will review the commonly used stepsize strategies.

Current stepsize strategies
To the best of our knowledge, current strategies fall into four categories: (i) A fixed stepsize from estimation of A 2 .

Theorem 14
SupposeL is used as an initial guess for the k th iteration, and we select the smallest q ∈ N such thatL k = η qL satisfies the local condition, for η ≥ 1. To guarantee convergence, it requires which is also the numbers of function evaluations required. We have Proof To guarantee convergence, it requires q such thatL k = η qL ≥ L. IfL ≤ L, q should be selected such that η qL ≤ ηL; otherwise q − 1 will be large enough to be selected, i.e.L k = η q−1L ≥ L.
Theorem 14 covers the setting of choice (i)-(iii), also referred to as the fixed stepsize strategy, backtracking method, and Nesterov's adaptive method, respectively. For fixed stepsize strategies,L ≥ L is selected for all iterations, which yields q = 0, and thus checking the local condition is not required [1]. For backtracking method,L =L k−1 and η > 1 is a parameter of the strategy. SinceL k is monotonically increasing in k, q is monotonically decreasing. Therefore, q at the k th iteration is equivalent to the total number of (extra) function evaluations for the rest of the iterations.
On the other hand, for Nesterov's adaptive method,L =L k−1 /2 and η = 2.L k is not monotonically increasing in k, and in each iteration, q is the number of function evaluations in the worst case. Notice that once the worst case occurs (having q function evaluations) in the k th iterations, q will be smaller sinceL k is sufficiently large. In Nesterov's universal gradient methods [12], Nesterov proved that for k iterations, the number of function evaluations is bounded by O(2k).
Theorem 14 illustrates the trade-off between three aspects: aggressiveness of initial guessL, recovering rate η, and the convergence rate. Methods with small (aggressive) initial guessL have the possibility to result in larger stepsize. However, it will yield a larger q, the number of function evaluations in the worst case. One could reduce q by setting a larger η, and soL could scale quickly towards L, but it will generate a slower rate of convergence (ηL). If one wants to preserve a good convergence rate (small η) with small number of function evaluations (small q), thenL could not be too small. In that case one has to give up on the opportunity of having large stepsizes. The fixed stepsize strategy is the extreme case of minimizing q by giving up the opportunity of having larger stepsizes.
The proposed stepsize strategy PUG tries to reduceL as (iii), but guidesL to increase reasonably and quickly when it fails to satisfy the local condition. In particular, by replacing L with its probabilistic upper bound, aggressiveL and fast recovering rate are allowed without slowing the convergence. This above feature does not obey the trade-off that constraints choice (i)-(iii). Also, PUG is flexible compared to (iv). It can be applied to all algorithms that require L, as well as mini-batch and block-coordinatetype algorithms which require submatrix of A.

PUG
In this section, we will use the tail bounds to develop PUG. Using Eq. (15), we first define the upper bound at different confidence level, with probability of at least 1 − . We point out that the probabilistic upper bound (15) does not rely on the assumption that the dataset is normalized with mean zero and unit variance, and so it is applicable to all types of datasets. The basic idea of PUG is to use the result in the following Theorem.

Theorem 15 SupposeL is used as an initial guess for the k th iteration, and we denote
where U( ) is defined as in (25). If we select the smallest q ∈ N such thatL k = η q PUG,NL satisfies the local condition, then with probability of at least 1− , it requires q = N to guarantee convergence. In particular, we have with probability of at least 1 − .

Proof
To guarantee convergence, it requires q such thatL k = η q PUG,NL ≥ L. When q = N ,L k = U( ) ≥ L with probability of at least 1 − . Theorem 15 shows the potential advantage of PUG. With any initial guessL, PUG is able to scaleL quickly towards L without interfering with the probabilistic convergence rate. This unique feature allows an aggressive initial guessL as Nesterov's adaptive strategy without low recovering rate nor slow convergence rate. Algorithm 1 provided details of PUG with N = 2. In Algorithm 1, the Lipschitz constant estimation from last iteration,L k , is first divided by 2 to be the initial guessL, which is the same as the Nesterov's adaptive method. We point out that,
In the case where computing μ max is impractical, it could be bounded by With the assumption that ξ i 's have zero mean and unit variance, μ max ≤ n. For A that does not satisfy these assumptions due to different normalization process of the data, (26) could be used to bound μ max . For the R in (25), one could use c 2 n as in Corollary 12, or a tighter estimation would be

Convergence bounds: regular strategies versus PUG
Different stepsize strategies would lead to different convergence rates even for the same algorithm. Since PUG is based on the probabilistic upper bound U( ) in (25), it leads to a probabilistic convergence of FISTA. In particular, with probability at least 1 − . Equation (28) holds with probability at least 1 − because of Eq. (25), which holds for all iterations in FISTA. In particular, once the instance (matrix A) is fixed, then we have know L ≤ U( ), with probability at least 1 − . If the probabilistic upper bound holds, then U( ) is the worst Lipschitz constant estimation computed by PUG in all iterations, and so (27) holds. Therefore, the above result could be obtained using the same argument as in the proof of convergence in [1]. When using regular stepsize strategies, FISTA results in convergence rates that is in the form of (21) with different η's (η > 1). For backtracking strategy, η would be an user-specified parameter. It is clear from (21) that convergence is better when η is close to 1. However, it would take more iterations and more function evaluations to find a satisfying stepsize, and these costs are not captured in (21). In the case of Nesterov's adaptive strategy [12], η = 2. Using the same analysis as in Sect. 3.2, L should be replaced with the upper bound in (22) for the average case, or U( ) in (25) for the probabilistic case. For the probabilistic case, those convergences are in the same order as in the case of using PUG, as shown in (28).
Therefore, PUG is competitive compared to other stepsize strategies in the probabilistic case. The strength of PUG comes from the fact that it is adaptive with strong theoretical guarantee that with high probability,L k will quickly be accepted at each iteration.

Mini-batch algorithms and block-coordinate algorithms
For mini-batch algorithms, each iteration is performed using only a subset of the whole training set. Therefore, in each iteration, we consider a matrix that contains the corresponding subset. This matrix is a submatrix of A with the same structure, and therefore it is also a random matrix with smaller sizem-by-n, wherem < m. Using the existing results, we can conclude that the associated U( ) in each iteration would be larger than those in full-batch algorithms. As a result, the guaranteed stepsize for mini-batch algorithms tends to be smaller than full-batch algorithms.
On the other hand, block-coordinate algorithms do not update all dimensions at once in each iteration. Rather, a subset of dimensions will be selected to perform the update. In such setting, we only consider the variables (columns of A) that are associated with the selected coordinates. We should consider a submatrix that is formed by columns of A. This submatrix itself is also a random matrix with smaller size m-by-n, wheren < n. Using the existing results, the guaranteed stepsize for block-coordinate algorithms tends to be larger.
Thus, with minor modifications PUG can be applied to mini-batch and blockcoordinate algorithms.

Numerical experiments
In the first part of this section, we will apply the bounds from Sect. 3 to illustrate the relationship between different parameters and L. Then, we will perform the PUG on two regression examples. The datasets used for the two regression examples can be found at https://www.csie.ntu.edu.tw/~cjlin/libsvm-tools/datasets.

Numerical simulations for average L
We consider three cases, and in each case we simulate A's in different dimension m's and n's. Each configuration is simulated with 1000 instances, and we study the sample average behaviors of L.
In the first case, we consider the most complicated situation and create random vector such that its entries are not identical nor independent. We use a mixture of three types of random variables (exponential, uniform, and multivariate normal) to construct the matrix A ∈ R m×n . The rows of A are independent samples of ξ T = (ξ 1 , ξ 2 , . . . , ξ n ). We divide A into three parts with n 1 , n 2 , and n 3 columns. Note that n 1 = n 2 = n 3 = n/3 up to rounding errors. We assign ξ with the elements where and (ξ n 1 +n 2 +1 , ξ n 1 +n 2 +2 , . . . , ξ n ) ∼ N (0 n 3 ×1 ,ˆ ).ˆ is a n 3 × n 3 matrix with 1 on the diagonal and 0.5 otherwise. ξ 1 , ξ 2 , . . . , ξ n 1 +n 2 are independent. The scaling factors of the uniform distribution and exponential distribution are used to normalize the uniform random variables ξ j such that E[ξ j ] = 0, and E[ξ 2 j ] = 1. Some entries of A are normally distributed or exponentially distributed, and we approximate the upper bound of the entries with c = 3. From statistics, we know that with very high probability, this approximation is valid.
In Fig. 1, we plot the sample average Lipschitz constant over 1000 instances. As expected, the expected Lipschitz constant is "trapped" between its lower and upper bound. We can see that the expected L increases when m and n increases with the ratio n/m is fixed. This phenomenon is due the fact that μ max = λ max ( ) increases as n increases.
To further illustrate this, we consider the second case. The setting in this case is the same as the first case except that we replaceˆ with I n . So, all the variables are linearly independent. In the case, μ max = 1 regardless the size of the A. The ratio n/m = 2 in this example. From Fig. 2, the sample average L does not increase rapidly as the size of A increases. These results match with the bound (22).
In the last case, we investigate the effect of the ratio n/m. The setting is same as the first case, but we keep n = 1024 and test with different m's. From Fig. 3, the sample average L decreases as m increases. This result suggests that a large dataset is favorable in terms of complexity, especially for large-scale (large n) ERM problems.

Regularized logistics regression
We implement FISTA with three different stepsize strategies (i) the regular backtracking stepsize strategy, (ii) the Nesterov's adaptive stepsize strategy, and (iii) the proposed adaptive stepsize strategy PUG. We compare the three strategies with an example in a 1 regularized logistic regression problem, in which we solve the convex optimization problem  We use the dataset gisette for A and b. Gisette is a handwritten digits dataset from the NIPS 2003 feature selection challenge. The matrix A is a 6000 × 5000 dense matrix, and so we have n = 5000 and m = 6000. The parameter ω is chosen to be the same as [9,21]. We choseL 0 = 1 for all three stepsize strategies. For backtracking stepsize strategy, we chose η = 1.5. We compare our proposed probabilitic bound and the deterministic upper bound L using A 2 ≤ trace(A T A). We estimate μ max = 1289.415 and R = 4955 using Eqs. (26) and (27), respectively. We thus obtain our probabilistic bound U(0.1) = 646.941, which is less than the deterministic upper boundL = 1163.345. Table 1 shows the performances of three stepsize strategies. T is the scaled computational time, nIter is the scaled number of iterations, nFunEva is the scaled number of function evaluations, and Avg.L is the average ofL used. This result encourages the two adaptive stepsize strategies as the number of iterations needed and the computational time are significantly smaller compared to the regular backtracking algorithm. This is due to the fact thatL could be a lot smaller than the Lipschitz constant L in this example, and so the two adaptive strategies provide more efficient update for FISTA. As shown in Table 1, even though Nesterov's strategy yields smallerL on average, it leads to higher number of function evaluations and so it takes more time than PUG.

Regularized linear regression
We also compare the three strategies with an example in a 1 regularized linear regression problem, a.k.a LASSO, in which we solve the convex optimization problem We use the dataset YearPredictionMSDt (testing dataset) for A and b. YearPredic-tionMSDt has matrix A is a 51630 × 90 dense matrix, and so we have n = 90 and m = 51630. The parameter ω is chosen to be 10 −6 . We choseL 0 = 1 for all three stepsize strategies. For backtracking stepsize strategy, we chose η = 1.5.
We compare our proposed probabilitic bound and the deterministic upper boundL using A 2 ≤ trace(A T A). We estimate μ max = 9.603 × 10 6 and R = 4.644 × 10 9 using Eqs. (26) and (27), respectively. We thus obtain our probabilistic bound U(0.1) = 1.982 × 10 7 , which is less than the deterministic upper boundL = 2.495 × 10 7 . The best performance in each aspect is bold. Table 2 shows the performance of three stepsize strategies, and the structure is same as Table 1. Unlike Gisette, adaptive strategies failed to provide smallL compared to L. Also, since n is very small, the cost of function evaluation is very cheap compared to Gisette. Therefore, both adaptive strategies do not outperform the backtracking strategy in this example. However, one can see that both adaptive strategies yielded to reduction in terms of the number of function evaluations. Therefore, one could expect they outperform backtracking strategy for larger/difficult instances.

Conclusions and perspectives
The analytical results in this paper show the relationship between the Lipschitz constant and the training set of an ERM problem. These results provide insightful information about the complexity of ERM problems, as well as opening up opportunities for new stepsize strategies for optimization problems.
One interesting extension could be to apply the same approach to different machine learning models, such as neural networks, deep learning, etc.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.