A unified approach for a 1D generalized total variation problem

We study a 1-dimensional discrete signal denoising problem that consists of minimizing a sum of separable convex fidelity terms and convex regularization terms, the latter penalize the differences of adjacent signal values. This problem generalizes the total variation regularization problem. We provide here a unified approach to solve the problem for general convex fidelity and regularization functions that is based on the Karush–Kuhn–Tucker optimality conditions. This approach is shown here to lead to a fast algorithm for the problem with general convex fidelity and regularization functions, and a faster algorithm if, in addition, the fidelity functions are differentiable and the regularization functions are strictly convex. Both algorithms achieve the best theoretical worst case complexity over existing algorithms for the classes of objective functions studied here. Also in practice, our C++ implementation of the method is considerably faster than popular C++ nonlinear optimization solvers for the problem.


Introduction
The problem addressed here is the 1-dimensional discrete signal denoising problem which is formulated as: (1D-Denoise) min (1) The fastest algorithm for MRF of general convex fidelity functions f i and convex regularization functions h i, j has complexity O(nm log n 2 m log nU ) (U = max i {u i − i }) [1]. The complexity model used here is the oracle unit cost complexity model that assumes O(1) time to evaluate the function value on any argument input of polynomial length. This complexity model has no restriction on the structure of the convex functions. Since for the 1D-Denoise, m = n−1, applying the algorithm of [1] to 1D-Denoise yields complexity of O(n 2 log n log nU ). Other than the O(log U ) factor which our algorithm adds, it improves the complexity for the dsc-class of objective functions as compared to the best existing algorithm by a factor of O(n log n), and for the cc-class of objective functions, our algorithm improves on the complexity of the best existing algorithm by a O(log n) factor.
The paper is organized as follows. Section 1.1 reviews literature on special cases of 1D-Denoise and the associated ad-hoc algorithms. Notations and preliminaries are introduced in Sect. 2. We present an overview of our unified KKT approach in Sect. 3, followed by the discussion of the dsc-algorithm in Sect. 4 and the cc-algorithm in Sect. 5. In Sect. 6, we present a numerically improved implementation of the KKT approach that combines the advantages of both the dsc-algorithm and the cc-algorithm. Numerical experimental comparison results are reported in Sect. 7. Conclusion and future research directions are provided in Sect. 8.

Literature review
One of the most studied special cases of the 1D-Denoise problem has 2 fidelity functions and 1 regularization functions, a.k.a. the total variation (TV) regularization functions: ( 2 − TV-unweighted) min The suffix "-unweighted" refers to the identical coefficient λ of all the regularization terms. The use of TV-regularized terms have been shown to lead to piecewise constant recovered signals with sharp jumps/edges. The 2 -TV-unweighted model is widely applied in signal processing, statistics, machine learning, and many fields of science and engineering [7,[17][18][19]. Rudin et al. in [19] first advocated the use of 1 regularization in image signal processing, and since then popularized total variation (TV) denoising models. In statistics contexts 2 -TV-unweighted is known as a least squares penalized regression model with total variation penalties [18]. The use of efficient taut string algorithm for the problem is well-known in the statistics community [3,[6][7][8]18]. The classic taut string algorithm solves the 2 -TV-unweighted problem in O(n) time.
A number of other algorithms were proposed to solve 2 -TV-unweighted in the signal processing community. Condat in [4] proposed an algorithm with theoretical complexity of O(n 2 ) which is fast in practice with observed performance of O(n). Johnson in [14] proposed an efficient dynamic programming algorithm with complexity O(n). Kolmogorov et al. in [16] proposed a message passing algorithm framework such that, when applied to the 2 -TV-unweighted problem, has complexity O(n). Barbero and Sra in [2] proposed three algorithms to solve the problem. The first one is a Projected Newton method. It is an iterative algorithm where for this case the authors reduce the cost of each iteration from a standard O(n 3 ) complexity to O(n) by exploiting the structure of the problem. The second one is a linearized taut string algorithm, which they showed is equivalent to Condat's algorithm in [4] and hence has worst case complexity of O(n 2 ). The third one is a hybrid taut string algorithm that combines the advantage of the classic taut string algorithm and the proposed linearized taut string algorithm, with complexity O(n S ) for some S ∈ (1, 2).
Some of the aforementioned algorithms can be applied to the weighted variant of ( 2 -TV-weighted) min The difference between the weighted and unweighted problems is that the regularized coefficients c i,i+1 can be different for each regularization term. The classic taut string algorithm can be easily adapted for this case while maintaining the O(n) complexity [2,6]. The Projected Newton method and the linearized taut string algorithm are adapted in [2] to solve this case as well, with no change in the respective complexities. Kolmogorov et al.'s message passing algorithm in [16] has O(n) complexity for the weighted case as well.
The use of absolute value ( 1 ) loss functions in data fidelity terms is known to be more robust to heavy-tailed noises and to the presence of outliers than the quadratic ( 2 ) loss functions [11,20]. As such, the following 1 -TV problem has been investigated in the literature: Storath et al. [20] studied an unweighted version of 1 -TV, where c i,i+1 = λ for all i, for real-valued and circle-valued data, and provided an O(nK ) algorithm for K denoting the number of different values in {a i } i=1,...,n . If the data is quantized to finitely many levels, the algorithm's complexity is O(n). In the worst case, this complexity can be O(n 2 ). Kolmogorov et al. [16] studied a generalized version of 1 -TV where each fidelity function is a convex piecewise linear function with O(1) breakpoints. They provided two efficient algorithms of complexities O(n log n) and O(n log log n) respectively. Hochbaum and Lu [11] studied a further generalized version where each fidelity function is a convex piecewise linear function with an arbitrary number of breakpoints. They provided an O(q log n) algorithm for q the total number of breakpoints of the n convex piecewise linear fidelity functions. Applying Hochbaum and Lu's algorithm to 1 -TV gives an O(n log n) algorithm since (q = n). The Tikhonov-regularized terms impose global smoothing over the denoised signals, rather than preserving sharp jumps/edges using the TV-regularized terms. With 2 fidelity functions the problem formulation is: The optimal solution to 2 -Tikhonov can be obtained by solving a system of linear equations. It is easy to verify that this system of equations is a tridiagonal system of equations (i.e., the matrix A is a tridiagonal matrix for the equations Ax = b). It is well-known that the Thomas algorithm [5] can solve this special form of system of linear equations in O(n) time.
All the aforementioned special cases can be cast under the umbrella of a more general pq problem, for any p, q ≥ 1: Weinmann et al. [21] studied the unweighted version of pq where all c i,i+1 = λ for a fixed λ value: Weinmann et al. solved problem (8) for manifold-valued data, which is more general than a real line considered here, with an O(nkT ) algorithm, where T is the number of iterations in order to converge to certain stopping criterion and k is the complexity of computation at each iteration. For the cases q = 1 (TV) and q = 2 (Tikhonov), they proved that k = O (1). The convergence of their algorithm is proved in [21] but there is no bound provided on the value of T . In addition, it was also studied in [21] the models where the fidelity terms and/or the regularization terms are replaced by the Huber function [13]: The Huber function is known to be more robust to outliers than the quadratic ( 2 ) functions [13]. Weinmann et al. showed that the computational complexity of every iteration of their algorithm for Huber functions is k = O(1). In this paper, we conduct the experimental study on two problems with Huber objective functions.

Notation and preliminaries
In this paper, unless specified otherwise, we do not assume any restriction on the functions f i and h i except for convexity and adopt the same unit cost complexity model as in [1] for MRF (2). Therefore, each function can be evaluated O(1) time for a given input. In addition, our algorithms provide solutions to the -accuracy of the optimal solution, in other words, our solution has the first log 1 digits same as the optimal solution. When we consider the case where a function f is differentiable, we assume that its derivative can be evaluated in O(1) time as well for any input. Otherwise, we can compute the function's gradients on the solution accuracy granularity, where the the left and right subgradients of function f on input x is: Thus both the left and right subgradients can be computed in O(1) time. The range from the left subgradient to the right subgradient of function f at x forms the subdifferential By the convexity of function f , for any x 1 < x 2 , we have A function f (x) is said to be strictly convex if for any x 1 < x 2 and λ ∈ (0, 1): Given a strictly convex function f (x), for any x 1 < x 2 , the second inequality of (11) hold strictly: We introduce the following subgradient-to-argument inverse operations. For a convex function h and a given subgradient value g, we would like to compute the maximal interval of argument z, The maximality of the interval implies that z L satisfies g ∈ ∂h(z L ) and ∀z < z L , h R (z) < g; similarly, z R satisfies that g ∈ ∂h(z R ) and ∀z > z R , h L (z) > g. We denote the two inverse operations to compute z L and z R as z L := (∂h) −1 L (g) and z R := (∂h) −1 R (g) respectively. Our algorithms will use these two inverse operations.
By the convexity of function h, for any two subgradient values g 1 < g 2 , we have In addition, if function h is strictly convex, as the subgradients are strictly increasing (from inequalities (12)), we have (∂h) −1 L (g) = (∂h) −1 R (g). In this case (∂h) −1 L (g) = (∂h) −1 R (g) and we denote this value as (∂h) −1 (g). As for the complexity of the inverse operations, it is possible to find the values of (∂h) −1 L (g) or (∂h) −1 R (g) for a given g by binary search on an interval of length O(U ). This is because function h is convex, thus finding these values reduces to finding the zeros of monotone subgradient functions. The complexity of computing a subgradientto-argument inverse is thus O(log U ) to the -accuracy. Note that if h has a special structure, the complexity of the inverse can be improved to O (1). Some examples are h being quadratic ( 2 ) or piecewise linear with few pieces, including absolute value ( 1 ) functions.

An equivalent formulation of 1D-Denoise
We now present an equivalent formulation of 1D-Denoise (1). Consider each regular- Without loss of generality, we have 0 ∈ ∂h i (c i ). 1 The split functions h i,i+1 and h i+1,i have the following properties: By splitting the regularization functions, we introduce the following equivalent formulation of 1D-Denoise (1): The formulation (13) and formulation (1) are equivalent: (1) and (13) share the same optimal solution.

Lemma 2 Formulations
..,n ) be the objective value of 1D-Denoise (1) for any given values ..,n ) be the objective value of problem (13) such that, given the values of {x i } i=1,...,n , the values of the z variables are selected to minimize the objective function. Note that given x i and x i+1 , either must be non-positive. Without loss of generality, let's assume that x i − x i+1 − c i ≥ 0 and thus x i+1 − x i + c i ≤ 0. Due to the monotonicity of h i,i+1 and h i+1,i on the nonnegative axis (Proposition 1), we have z i,i+1 = x i − x i+1 − c i and z i+1,i = 0 being the optimal values for this given pair of x i and x i+1 that minimize the objective. Plugging these two values of z i,i+1 and Applying the above analysis to all pairs of (x i , x i+1 ), we have: ..,n ). Therefore the lemma holds.

The KKT optimality conditions and overview of our approach
We illustrate the KKT optimality conditions of 1D-Denoise (13). Let the dual variable for each constraint x i − x i+1 ≤ z i,i+1 + c i be λ i,i+1 ; let the dual variable for each constraint x i+1 − x i ≤ z i+1,i − c i be λ i+1,i ; let the dual variable for each constraint z i,i+1 ≥ 0 be μ i,i+1 and let the dual variable for each constraint z i+1,i ≥ 0 be μ i+1,i . The KKT optimality conditions state that, (Primal feasibility) Equations (14) and (15) are stationarity conditions for x and z variables respectively, inequalities (16) and (17) are primal and dual feasibility conditions for the primal x, z variables and the dual λ, μ variables respectively, and equations (18) are complementary slackness (C-S) conditions. Our approach directly solves the above KKT optimality conditions. We derive two key results from the KKT conditions. The first is a propagation lemma. For the differentiable-strictly-convex class, we show that, for any value of x 1 , one can follow the KKT conditions to uniquely determine the values of x 2 , . . . , x n . The only leftover condition in the list of KKT conditions for which the generated sequence (x 1 , x 2 , . . . , x n ) may not satisfy, i.e., prevents the generated sequence from being optimal, is an equation that requires n i=1 f i (x i ) equal to 0 (achieved by summing up the equations in (14)). If not, then thanks to the convexity of f i and h i,i+1 , we next have a monotonicity lemma that shows if n i=1 f i (x i ) > 0, we should search for a smaller x 1 ; otherwise we should search for a larger x 1 . Combining both the propagation lemma and the monotonicity lemma naturally yields a binary search algorithm to the optimal x 1 , which further leads to optimal x 2 to x n based on the propagation result. This is the dsc-algorithm we shall present next. For the convex-convex class, the propagation result is extended from a unique value to a unique range, i.e., for any value of x 1 , one can follow the KKT conditions to uniquely determine the ranges of x 2 , . . . , x n . Those ranges finally determine a lower bound and an upper bound for the sum n i=1 f i (x i ). If value 0 is within the bounds, then x 1 is the optimal value. Otherwise, we have an extended monotonicity result showing that if the lower bound is above 0, then we should search for a smaller x 1 , otherwise (the upper bound is below 0), we should search for a larger x 1 . By the dichotomy, the optimal value of x 1 is found. This process is then repeated from x 2 to x n to find the respective optimal values. This is the cc-algorithm we shall present in Sect. 5.

KKT approach for differentiable-strictly-convex class: dsc-algorithm
With convex differentiable fidelity functions and strictly convex regularization functions, we show that if we fix the value of x 1 , then all the values of x 2 to x n can be uniquely determined by the KKT optimality conditions. This is proved in the following propagation lemma: Lemma 3 (Propagation Lemma) Given any value of x 1 , the KKT optimality conditions (14), (15), (16), (17) and (18) uniquely determine the other values of x 2 , . . . , x n .
Proof We prove the lemma by induction on i from 1 to n. The case of i = 1 is trivial. Suppose the values of x 1 , . . . , x i are uniquely determined by x 1 for some i (1 ≤ i ≤ n − 1), we show that the value of x i+1 is also uniquely determined.
Adding up the equations of j = 1, . . . , i in the stationarity conditions (14), we have There are 5 different cases about Eq. (19), depending on the value of i j=1 f j (x j ): Then by the complementary slackness conditions (18) On the other hand, by the primal feasibility conditions (16) (17)), which is a contradiction. Therefore we have λ i+1,i = i j=1 f j (x j ) and λ i,i+1 = 0. Then by the stationarity condition on z i+1,i in (15), we have that there . As a result, we have z i+1,i > 0, and this implies that μ i+1,i = 0 by the complementary slackness conditions (18). And since h i+1,i (z i+1,i ) is a strictly convex function, z i+1,i is thus uniquely determined by Since λ i+1,i > 0, by the complementary slackness conditions (18), we have x i+1 is uniquely determined by the equation , we can derive the same contradiction as Case 1. As a result, it must . Then we consider the stationarity conditions (15) , which still implies that z i+1,i = 0 by the strict convexity of h i+1,i . In either case, by the complementary slackness conditions (18) with λ i+1,i > 0, we have x i+1 is uniquely determined by the equation In this case, we have λ i,i+1 = λ i+1,i . If both λ i,i+1 and λ i+1,i are positive, then by the complementary slackness conditions (18) on λ i,i+1 and λ i+1,i , and the primal feasibility conditions (16) , then the complementary slackness condition (18) also implies that z i,i+1 = 0. Therefore we always have z i,i+1 = 0. The same analysis shows that z i+1,i = 0. Then by the primal feasibility conditions (16) on x i and , which still implies that z i,i+1 = 0 by the strict convexity of h i,i+1 . In either case, by the complementary slackness conditions (18) with This case is symmetric to Case 1. Following the same reasoning in Case 1, we can show that As a result, we have z i+1,i > 0, thus μ i,i+1 = 0 by the complementary slackness conditions (18).
Since λ i,i+1 > 0, by the complementary slackness conditions (18), we have x i+1 is uniquely determined by the equation This completes the proof for the case of i + 1.
Lemma 3 implies that, given a value of x 1 , the values of x 2 to x n are uniquely determined by the following iterative equations: Based on the convexity of functions we have the following monotonicity property for any two sequences of x 1 , . . . , x n generated by Eqs. (20) and (21): n ) be the respective sequence of x values determined by the value of x 1 and Eqs. (20) and (21). Then we have x (1) i < x (2) i for all i = 1, . . . , n. i (i = 1, . . . , n). For i = 1, the values of x (1) 1 and x (2) 1 are given and satisfy x (1)

Proof The proof is by induction on
Due to the convexity of h i,i+1 and h i+1,i , Eq. (21) implies that z i is a nondecreasing function of i j=1 f j (x j ). On the other hand, since all f j ( j = 1, . . . , i) functions are convex, by the induction hypothesis, we have f j (x (1) j ) ≤ f j (x (2) j ) for j = 1, . . . , i.
. As a result, we have z The only equation in the KKT optimality conditions that a given sequence of x 1 , . . . , x n determined by Eqs. (20) and (21) may violate is the last stationarity condition (14) for x n . This is because x n , λ n−1,n and λ n,n−1 are determined in the step of computing x n from x n−1 , based on the equation n−1 j=1 f j (x j ) = λ n,n−1 − λ n−1,n , however, the generated values of x n , λ n−1,n and λ n,n−1 do not necessarily satisfy the last stationarity condition for x n , f n (x n ) − λ n−1,n + λ n,n−1 = 0. On the other hand, we observe that if we sum up all the stationarity conditions (14) for the x variables, we have: The equation for x n in the stationarity conditions (14) can be equivalently replaced by Eq. (22). Hence a sequence of x 1 , . . . , x n determined by Eqs. (20) and (21) satisfy the KKT optimality conditions if and only if Eq. (22) also holds. The above analysis implies a binary search algorithm to solve the KKT optimality conditions. In every iteration, we try a value of x 1 , and compute the values of x 2 to x n based on Eqs. (20) and (21). Then we check whether Eq. (22) holds for the generated sequence of x 1 , . . . , x n . If yes, then the generated sequence of x 1 , . . . , x n satisfies the KKT optimality conditions, thus it is an optimal solution to 1D-Denoise. Otherwise, we determine the next value of x 1 to try based on the sign of n i=1 f i (x i ) of the currently generated sequence of x 1 , . . . , we try a larger value of x 1 . The binary search efficiently determines the optimal value of x 1 , so are the optimal values of x 2 to x n by Eqs. (20) and (21). The complete dsc-algorithm is presented in Box 1.  (20) and (21): Step 2: Note that if the regularization functions h i are quadratic ( 2 ), then the complexity of subgradient-to-argument inverse is O(1), thus the dsc-algorithm can be sped-up to O(n log U ) complexity.

Remark 6
One may ask if transforming the unconstrained optimization form of 1D-Denoise (1) to a more complicated constrained optimization form (13) is necessary. Actually, the KKT conditions, as shown in the following, with respect to the original formulation (1) are much simpler: However, we argue that the above simpler form of KKT conditions are not sufficient to provide a rigorous proof to the unique value propagation result (Lemma 3) for the differentiable-strictly-convex class. For a fixed value of x 1 , since h 1 is strictly convex, we can change the 0-inclusion formula 0 ∈ f 1 (x 1 ) + ∂h 1 , and thus uniquely determine the value of x 2 as ). However, even x 1 and x 2 are fixed, the subdifferential of h 1 at x 1 − x 2 , ∂h 1 (x 1 − x 2 ), may be a set containing more than one subgradient. As a result, the value of x 3 can not be uniquely determined by the 0-inclusion formula , nor does it help even if we add the two 0-inclusion formulas regarding f 1 (x 1 ) and f 2 (x 2 ) because the subdifferential of h 1 at x 1 − x 2 , ∂h 1 (x 1 − x 2 ), may not be able to cancel each other in the two formulas.
For the general convex-convex class, however, as we shall show in the next section, we can follow the simplified KKT conditions to present our algorithm in a concise way.

KKT approach for convex-convex class: cc-algorithm
We extend the ideas developed from the dsc-algorithm to solve 1D-Denoise of arbitrary convex fidelity and regularization functions, leading to the cc-algorithm.
The impact of removing the differentiability and strict convexity assumptions are two-fold: the non-differentiability of f i (x i ) implies that a given x i value corresponds to a non-singleton subdifferential instead of a unique derivative f (x i ) in the differentiable case; the non-strict convexity of h i,i+1 (and h i+1,i ) implies that a subgradient value g can be inversely mapped to a non-singleton interval of arguments , instead of a unique z argument (∂h i,i+1 ) −1 (g) (and (∂h i+1,i ) −1 (g)) in the strictly convex case. Both observations imply that, based on the KKT optimality conditions, a given value of x 1 does not uniquely determine the other variables x 2 , . . . , x n to values, but to ranges.
For this general convex-convex class, instead of working on the KKT conditions of the reformulation (13), we directly solve the KKT conditions of the original uncon-strained formulation (1) of the 1D-Denoise problem, shown as follows: We first have the following range propagation result which can be viewed as an extension of the propagation lemma in the dsc-algorithm.
Proof We proof the lemma by induction on i. Together with proving Eq. (24), we prove the following series of inequalities: Let's first prove Eq. (24) and inequality (25) hold for i = 1. In order to make the first 0-inclusion formula, 0 ∈ f 1 (x 1 ) + ∂h 1 (x 1 − x 2 ), in the KKT conditions (23) hold, we have: Hence inequality (25) holds for i = 1. Further, due to the convexity of h 1 , we have Therefore Eq. (24) holds for i = 1. The induction from i − 1 to i is straightforward by leveraging the ith 0-inclusion , in the KKT conditions (23). By the induction hypothesis on inequality (25) on i − 1, we have: As a result, according to the ith 0-inclusion formula, we have Hence inequality (25) holds for i. Further, due to the convexity of h i , we have Therefore Eq. (24) holds for i as well.
Similar to the analysis of the dsc-algorithm, the last 0-inclusion formula yet to satisfy is the nth one: 0 ∈ f n (x n ) − ∂h n−1 (x n−1 − x n ). By the proof in Lemma 7, we have inequality (25) holds for n − 1: Therefore, to check whether the last 0-inclusion formula holds, it is equivalent to check whether If inequality (26) holds, then we conclude that the given value of x 1 is an optimal value of x 1 . Otherwise, similar to Corollary 4 for the dsc-algorithm, thanks to convexity, we have the following extended monotonicity property.  (2) kkt,1 , u (2) kkt,1 ], [l (2) kkt,2 , u (2) kkt,2 ], . . . , [l (2) kkt,n , u (2) kkt,n ] , be the respective sequence of ranges of x i values determined by x (1) 1 and x (2) 1 according to Eq. (24). Then we have, for all i = 1, . . . , n:

Corollary 8 (Extended Monotonicity Property) Let x
Proof The proof is by a straightforward induction on i. We prove for inequalities (27). Inequalities (28) are proved in a similar way. The inequality is true for i = 1 because l kkt,1 . Suppose the result is true for all j = 1, . . . , i for some i (1 ≤ i ≤ n − 1). We prove that it is also true for i + 1. By the induction hypothesis and the convexity of functions f j , we have Then by the convexity of function h i , we have Therefore, Combining Lemma 7 and Corollary 8 gives a binary search algorithm to find an optimal value of x 1 . In every step, we try a value of x 1  is found, we plug it into 1D-Denoise (1) to reduce the problem from n variables to n −1 variables of the same form. In the reduced 1D-Denoise problem, f 1 (x * 1 ) is removed since it is a constant, and the deviation function of . Thus we can repeat the above process to find an optimal solution of x 2 , x * 2 . As a result, it requires n iterations to solve an optimal solution (x * 1 , . . . , x * n ) for 1D-Denoise. In the ith iteration, x * i is solved and the problem 1D-Denoise (1) is reduced to a smaller problem, of the same form, with one less variable.
We first present a subroutine in Box 2 that solves an optimal value of x i on the reduced 1D-Denoise problem of n−i +1 variables, x i , x i+1 , . . . , x n , with fidelity func- AlgoBox 2: Subroutine to solve a reduced 1D-Denoise problem of variables With the above subroutine, the complete cc-algorithm is in Box 3.
The complexity of subroutine SOLVE_REDUCED_1D-Denoise is O(n log 2 U ), hence the total complexity of the cc-algorithm is O(n 2 log 2 U ). Therefore,

Theorem 9
The 1D-Denoise problem of arbitrary convex fidelity and regularization functions is solved in O(n 2 log 2 U ) time.
Note that if the regularization functions h i are quadratic ( 2 ) or piecewise linear with few pieces, including absolute value ( 1 ) functions, then the complexity of subgradientto-argument inverse is O(1). As a result, the cc-algorithm can save an O(log U ) factor, with complexity speed-up to O(n 2 log U ).

A numerically improved implementation
In this section, we will present a numerically improved implementation of the KKT approach that combines the advantages of both the dsc-algorithm and the ccalgorithm. In addition, for special cases of 2 -TV-unweighted, 2 -TV-weighted and 1D-Laplacian, we specialize the numerically improved implementation to make it empirically faster.
While the dsc-algorithm is of attractive complexity O(n log 2 U ), in practice it suffers from numerical instability. This is because the algorithm sums the derivatives of f 1 (x 1 ) to f n (x n ), where the numerical errors resulting from the calculation of each derivative accumulate. In contrast, the cc-algorithm, of complexity O(n 2 log 2 U ), does not suffer from numerical instability because it repeats the binary search for every variable x i , at the expense of an additional O(n) complexity factor.
On the other hand, however, in the cc-algorithm, at every propagation, two sequences of quantities, (l kkt, j ) j and (u kkt, j ) j are computed, which incurs two times the computation over the propagation in the dsc-algorithm.
As a result, we propose a numerical implementation that combines the advantages of both the dsc-algorithm and the cc-algorithm: We repeat the binary search for every variable x i , while in every propagation, we only compute one sequence of quantities -for the convex-convex class, we only compute the upper bounds (u kkt, j ) j (one may as well compute only the lower bounds (l kkt, j ) j ). This implies the following changes to SOLVE_REDUCED_1D_Denoise in Box 2: In Step 1, the computations on (z j;L ) j and (l kkt, j ) j are saved; In Step 2, the stopping criterion is left with only u − l < . It's easy to verify that these changes do not affect the correctness of the subroutine. The trade-off is that we save the computation in Step 1, while pay for the potential cost of more iterations because we remove one stopping criterion in Step 2. There is one more earning with these changes, though, that is we have one succinct code to cover both the differentiable-strictly-convex class and the general convex-convex class.
Besides, to eliminate redundant computation, we conduct a "bound check" as follows: A lower bound and an upper bound of the optimal solution of each variable x i are maintained. The lower bounds are all initialized to −U , and the upper bounds are all initialized to U , for all variables x i . The bounds are dynamically updated through the propagation computations. And during each propagation, we check if each propagated value x i violates its latest bounds -if violation happens, the current propagation stops (saving the remaining propagation computation on variables after x i ) and the bounds of some variables are updated. Meanwhile, the type of violation guides the next trial value to propagate.
The pseudo-code of the numerically improved implementation, named as the KKTalgorithm, is presented in Box 4.

Experimental study
We implement in C++ the KKT-algorithm in Box 4 and compare our implementation with existing solvers for the special cases of 1D-Denoise problem (1) discussed in Sect. 1.1. For 2 -TV-unweighted problem (3), 2 -TV-weighted problem (4), 1 -TV problem (5) and 2 -Tikhonov problem (6), we compared our implementation with efficient specialized C++ solvers [2,4,5,14,16] and the experimental results showed that our algorithm is not advantageous. For pq problem (7) and problems with Huber objective functions (9), the work of [21] does not provide C++ implementation of their algorithm, nor are we aware of other publicly available specialized C++ solvers for these classes of problems. As a result, we compare our implementation with popular C++ nonlinear optimization solver softwares in solving the pq problem (7) and two problems with Huber objective functions. Those solvers solve the three problems using first-order methods, where we feed in objective values and gradient values of the problems to the solvers. We find out that our solver is much faster than the nonlinear optimization solvers in solving the three problems.
The two problems with Huber objective functions we compare in the experiment are: (Huber-TV) min ( 2 -Huber) min Huber-TV problem (30) has Huber functions in the fidelity terms and 2 -Huber problem (31) has Huber functions in the regularization terms. We compare the KKT-algorithm with the following three popular C++ nonlinear optimization solvers: 1. Ceres solver 2 : This solver is provided by Google and has been used in production at Google since 2010. It is a popular optimization solver for robotics and other areas in industry. It provides API for modeling and solving general unconstrained minimization problems, which suits for our cases here. 2. NLopt solver 3 : This solver is an open source library for nonlinear optimization that provides many different algorithms and low-level code optimization. 3. dlib solver 4 : This solver is an open source library for machine learning algorithms and tools. It contains general purpose unconstrained nonlinear optimization algorithms that are suitable for our cases here.
The three solvers provide different first-order algorithms to solve unconstrained nonlinear optimization problems, which require the objective value and gradient value of a problem to be fed in. For each of the three solvers, we ran multiple different first-order methods they provide and recorded each method's running time and output objective value upon the method stops. These records provide a range of running times and objective values for each solver. We compare the running times and objective values output from the KKT-algorithm with those ranges of each solver respectively. For objective value comparison, rather than reporting the raw objective values, we report the relative objective value gap, which uses the objective value of the compared solver minus the objective value of the KKT-algorithm, then divided by the objective value of the KKT-algorithm. For the three problems considered here, as the objective values are always positive, hence a positive relative objective value gap implies that the KKT-algorithm gives a better solution upon stopping, while a negative relative objective value gap implies that, upon the algorithms stop, the nonlinear optimization solver that we compare to gives a better solution.
In nonlinear optimization, stopping criteria need to be specified for each numerical algorithm, including ours. The stopping criteria of the KKT-algorithm and the three solvers are set as follows: 1. KKT-algorithm: The stopping criterion is that the gap between the lower bound i and u i for each variable x i is less than = 10 −6 (See the algorithm pseudo-code in Box 4). 2. Ceres solver: The stopping criteria are that either (i) the maximum difference between two consecutive solution vectors is less than 10 −6 , or (ii) a running time limit of 5 minutes, whichever reaches first. 3. NLopt solver: Same as Ceres solver. 4. dlib solver: dlib solver does not provide the above two stopping criteria. Instead, it provides the objective value change stopping criterion: we set that if in successive iterations, the objective value change is less than 10 −2 , the algorithm stops and output results.
In all the three problems studied, each value of c i and c i,i+1 is sampled uniformly from (0, 1), each value of a i is sampled uniformly from (−1, 1). For Huber-TV problem (30), each k i value is determined as k i = b i |a i |, where coefficient b i is sampled uniformly from (0.5, 1.0). Similarly, for 2 -Huber problem (31), each k i,i+1 value is determined as We run the experiment on a MacBook Pro laptop with Intel Core i7 2.2 GHz processor. For each of the three problems, we run the KKT-algorithm and the three solvers with the number of variables, n, increasing from 10 2 to 10 7 by a factor of 10. For each n, we randomly generate 5 problem instances following the above parameter generation scheme, and record the running times and output objective values upon stopping, for the KKT-algorithm and the three solvers respectively. We find out that in each run, the KKT-algorithm always gives a smaller objective value in a much shorter time, over all compared methods provided by the three solvers. In the following tables, we report averaged results, including averaged running time of the KKT-algorithm, averaged range of running times of the three solvers, and averaged range of relative objective value gaps of the three solvers to the KKT-algorithm. We note that, for all the problems tested, for the three solvers, the algorithm achieving the shortest running time and the algorithm achieving the best objective value are often not the same. For the three solvers, the averaged minimum and maximum running times over each solver's different algorithms are reported. All nonzero numbers are rounded to two significant digits after the decimal points Table 2 Average minimum and maximum relative objective value gaps from the three nonlinear optimization solvers to the KKT-algorithm for pq problem (7) of increasing number of variables, n pq ( p = q = 4) 10 2 (%) 10 3 (%) 10 4 (%) 10 5 (%) 10 6 (%) 10 7 (%) All numbers are rounded to two significant digits after the decimal points. We use "-" in the second row of the table to denote that the relative objective value gaps are computed against the objective values of the KKT-algorithm

Ceres
The experimental results for solving the pq problem (7) with different solvers are shown in Table 1 for running times and Table 2 for objective value comparison.
For the pq problem, one can see that all solvers have similar running times for n = 10 2 . For n increasing from 10 3 to 10 7 , the KKT-algorithm is from 3 to 7 times faster than Ceres and from 10 to 25 times faster than NLopt. For dlib, except for the case of n = 10 3 , the KKT-algorithm is from 1.4 to 8 times faster. The gaps between objective values are not significant.
The experimental results for solving the Huber-TV problem (30) with different solvers are shown in Table 3 for running times and Table 4 for objective value comparison. For the three solvers, the averaged minimum and maximum running times over each solver's different algorithms are reported. All nonzero numbers are rounded to two significant digits after the decimal points All numbers are rounded to two significant digits after the decimal points. We use "-" in the second row of the table to denote that the relative objective value gaps are computed against the objective values of the KKT-algorithm For the Huber-TV problem (30), the four solvers all have negligible running times for n = 10 2 . For n = 10 3 , the running time of the KKT-algorithm and the minimum running time of Ceres remain negligible, but the minimum running times of NLopt and dlib are several times slower than Ceres. From n = 10 4 to n = 10 7 , the KKT-algorithm is from 8 to 10 times faster than Ceres, from 12 to 18 times faster than NLopt, and from 160 to 324 times faster than dlib. The relative objective value gaps are overall much more significant for the Huber-TV problem than for the pq problem. One reason might be that the TV regularization terms are not strictly convex and not differentiable at the point 0. Ceres and NLopt both have the largest relative objective value gaps at For the three solvers, the averaged minimum and maximum running times over each solver's different algorithms are reported. All nonzero numbers are rounded to two effective digits after the decimal points Table 6 Average minimum and maximum relative objective value gaps from the three nonlinear optimization solvers to the KKT-algorithm for 2 -Huber problem (31) of increasing number of variables, n 2 -Huber 10 2 (%) 10 3 (%) 10 4 (%) 10 5 (%) 10 6 (%) 10 7 (%) KKT ------ All numbers are rounded to two effective digits after the decimal points. We use "-" in the second row of the table to denote that the relative objective value gaps are computed against the objective values of the KKT-algorithm the case n = 10 7 , while dlib has good relative objective value gap for this large-scale case. However, dlib has the largest relative objective value gap for the small-scale case of n = 10 2 . The experimental results for solving the 2 -Huber problem (31) with different solvers are shown in Table 5 for running times and Table 6 for objective value comparison.

Ceres
For the 2 -Huber problem (31), the four solvers all have negligible running times for n = 10 2 and n = 10 3 . From n = 10 4 to n = 10 7 , the KKT-algorithm is from 2.5 to 5 times faster than Ceres, from 16 to 30 times faster than NLopt, and from 1.3 to 5.7 times faster than dlib. Similar for the pq problem, the gaps between objective values are negligible.

Conclusion and future directions
In this paper we present two efficient algorithms to solve the 1D-Denoise problem (1) for different classes of objective functions. Both algorithms follow a unified approach that directly solves the KKT optimality conditions of the 1D-Denoise problem. For convex differentiable fidelity functions and strictly convex regularization functions, our dsc-algorithm has O(n log 2 U ) time complexity; for arbitrary convex fidelity and regularization functions, our cc-algorithm has O(n 2 log 2 U ) time complexity. The numerically improved algorithm, KKT-algorithm, that combines the advantages of both the dsc-algorithm and the cc-algorithm, is presented and implemented in C++. For complicated objective functions, such as higher order p functions of p > 2 and Huber functions, the KKT-algorithm has much better performance over existing popular nonlinear optimization solvers.
There are many directions that could potentially make use of the results or push further research based on the ideas presented here. The 1D-Denoise problem considered here only penalizes the first order differences, x i − x i+1 . It would be interesting to study if a similar technique can be applied to solve problems that penalize higher order differences, such as second order difference x i−1 − 2x i + x i+1 , which appears in some trend filtering applications [15]. On the other hand, it would be interesting to study if we can use the algorithms here as subroutines to solve higher dimensional denoising problems, such as 2D denoising problems in imaging [2,16,21]. From a graph-theoretic perspective, the 1D-Denoise problem is defined on a path graph. It would be interesting to study whether we can adopt the algorithms here to provide a faster algorithm to solve problems defined on more general graphs, such as the MRF problem (2). Inspired by Weinmann et al.'s work in [21] on manifold-valued data, we are very interested in investigating whether our methods can be generalized from the Euclidean space to manifolds as well.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.