Abstract
Learning dictionaries suitable for sparse coding instead of using engineered bases has proven effective in a variety of image processing tasks. This paper studies the optimization of dictionaries on image data where the representation is enforced to be explicitly sparse with respect to a smooth, normalized sparseness measure. This involves the computation of Euclidean projections onto level sets of the sparseness measure. While previous algorithms for this optimization problem had at least quasi-linear time complexity, here the first algorithm with linear time complexity and constant space complexity is proposed. The key for this is the mathematically rigorous derivation of a characterization of the projection’s result based on a soft-shrinkage function. This theory is applied in an original algorithm called Easy Dictionary Learning (EZDL), which learns dictionaries with a simple and fast-to-compute Hebbian-like learning rule. The new algorithm is efficient, expressive and particularly simple to implement. It is demonstrated that despite its simplicity, the proposed learning algorithm is able to generate a rich variety of dictionaries, in particular a topographic organization of atoms or separable atoms. Further, the dictionaries are as expressive as those of benchmark learning algorithms in terms of the reproduction quality on entire images, and result in an equivalent denoising performance. EZDL learns approximately 30 % faster than the already very efficient Online Dictionary Learning algorithm, and is therefore eligible for rapid data set analysis and problems with vast quantities of learning samples.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.
Bauer, F., & Memisevic, R. (2013). Feature grouping from spatially constrained multiplicative interaction. In Proceedings of the International Conference on Learning Representations. arXiv:1301.3391v3.
Bell, A. J., & Sejnowski, T. J. (1997). The "independent components" of natural scenes are edge filters. Vision Research, 37(23), 3327–3338.
Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Belmont: Athena Scientific.
Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press.
Blackford, L. S., et al. (2002). An updated set of basic linear algebra subprograms (BLAS). ACM Transactions on Mathematical Software, 28(2), 135–151.
Bottou, L., & LeCun, Y. (2004). Large scale online learning. In Advances in Neural Information Processing Systems (Vol. 16, pp. 217–224).
Bredies, K., & Lorenz, D. A. (2008). Linear convergence of iterative soft-thresholding. Journal of Fourier Analysis and Applications, 14(5–6), 813–837.
Coates, A., & Ng, A. Y. (2011). The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of the International Conference on Machine Learning (pp. 921–928).
Deutsch, F. (2001). Best approximation in inner product spaces. New York: Springer.
Dong, W., Zhang, L., Shi, G., & Wu, X. (2011). Image deblurring and super-resolution by adaptive sparse domain selection and adaptive regularization. IEEE Transactions on Image Processing, 20(7), 1838–1857.
Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613–627.
Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal \(\ell _1\)-norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6), 797–829.
Duarte-Carvajalino, J. M., & Sapiro, G. (2009). Learning to sense sparse signals: Simultaneous sensing matrix and sparsifying dictionary optimization. IEEE Transactions on Image Processing, 18(7), 1395–1408.
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, 1(3), 211–218.
Elad, M. (2006). Why simple shrinkage is still relevant for redundant representations? IEEE Transactions on Information Theory, 52(12), 5559–5569.
Foucart, S., & Rauhut, H. (2013). Mathematical introduction to compressive sensing. New York: Birkhäuser.
Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Alken, P., et al. (2009). GNU scientific library reference manual (3rd ed.). Bristol: Network Theory Ltd.
Gharavi-Alkhansari, M., & Huang, T. S. (1998). A fast orthogonal matching pursuit algorithm. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. III, pp. 1389–1392).
Goldberg, D. (1991). What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1), 5– 48.
Hawe, S., Seibert, M., & Kleinsteuber, M. (2013). Separable dictionary learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 438–445).
Hoggar, S. G. (2006). Mathematics of digital images: Creation, compression, restoration, recognition. Cambridge: Cambridge University Press.
Horev, I., Bryt, O., & Rubinstein, R. (2012). Adaptive image compression using sparse dictionaries. In Proceedings of the International Conference on Systems, Signals and Image Processing (pp. 592–595).
Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457– 1469.
Hoyer, P. O., & Hyvärinen, A. (2000). Independent component analysis applied to feature extraction from colour and stereo images. Network: Computation in Neural Systems, 11(3), 191–210.
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148(3), 574–591.
Hurley, N., & Rickard, S. (2009). Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10), 4723–4741.
Hyvärinen, A. (1999). Sparse code shrinkage: Denoising of nongaussian data by maximum likelihood estimation. Neural Computation, 11(7), 1739–1768.
Hyvärinen, A., & Hoyer, P. O. (2000). Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720.
Hyvärinen, A., Hoyer, P. O., & Inki, M. (2001). Topographic independent component analysis. Neural Computation, 13(7), 1527–1558.
Hyvärinen, A., Hurri, J., & Hoyer, P. O. (2009). Natural image statistics–A probabilistic approach to early computational vision. London: Springer.
Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6), 1233– 1258.
Kavukcuoglu, K., Ranzato, M., Fergus, R., & LeCun, Y. (2009). Learning invariant features through topographic filter maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1605–1612).
Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480.
Kreutz-Delgado, K., Murray, J. F., Rao, B. D., Engan, K., Lee, T.-W., & Sejnowski, T. J. (2003). Dictionary learning algorithms for sparse representation. Neural Computation, 15(2), 349–396.
Laughlin, S. B., & Sejnowski, T. J. (2003). Communication in neuronal networks. Science, 301(5641), 1870–1874.
Liu, J., & Ye, J. (2009). Efficient Euclidean projections in linear time. In Proceedings of the International Conference on Machine Learning (pp. 657–664).
Lopes, M. E. (2013). Estimating unknown sparsity in compressed sensing. In Proceedings of the International Conference on Machine Learning (pp. 217–225).
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2009a). Online dictionary learning for sparse coding. In Proceedings of the International Conference on Machine Learning (pp. 689–696).
Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009b). Non-local sparse models for image restoration. In Proceedings of the International Conference on Computer Vision (pp. 2272–2279).
Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. The Computer Journal, 7(4), 308–313.
Neudecker, H. (1969). Some theorems on matrix differentiation with special reference to Kronecker matrix products. Journal of the American Statistical Association, 64(327), 953–963.
Olmos, A., & Kingdom, F. A. A. (2004). A biologically inspired algorithm for the recovery of shading and reflectance images. Perception, 33(12), 1463–1473.
Olshausen, B. A. (2003). Learning sparse, overcomplete representations of time-varying natural images. In Proceedings of the International Conference on Image Processing (Vol. I, pp. 41–44).
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.
Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23), 3311–3325.
Potluru, V. K., Plis, S. M., Le Roux, J., Pearlmutter, B. A., Calhoun, V. D., & Hayes, T. P. (2013). Block coordinate descent for sparse NMF. In Proceedings of the International Conference on Learning Representations. arXiv:1301.3527v2.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (2007). Numerical recipes: The art of scientific computing (3rd ed.). Cambridge: Cambridge University Press.
Rigamonti, R., Sironi, A., Lepetit, V., & Fua, P. (2013). Learning separable filters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2754–2761).
Ringach, D. L. (2002). Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. Journal of Neurophysiology, 88(1), 455–463.
Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66.
Rozell, C. J., Johnson, D. H., Baraniuk, R. G., & Olshausen, B. A. (2008). Sparse coding via thresholding and local competition in neural circuits. Neural Computation, 20(10), 2526–2563.
Skretting, K., & Engan, K. (2010). Recursive least squares dictionary learning algorithm. IEEE Transactions on Signal Processing, 58(4), 2121–2130.
Skretting, K., & Engan, K. (2011). Image compression using learned dictionaries by RLS-DLA and compared with K-SVD. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1517–1520).
Society of Motion Picture and Television Engineers (SMPTE). (1993). Recommended practice RP 177–193: Derivation of basic television color equations.
Theis, F. J., Stadlthanner, K., & Tanaka, T. (2005). First results on uniqueness of sparse non-negative matrix factorization. In Proceedings of the European Signal Processing Conference (Vol. 3, pp. 1672–1675)
Thom, M., & Palm, G. (2013). Sparse activity and sparse connectivity in supervised learning. Journal of Machine Learning Research, 14, 1091–1143.
Tošić, I., Olshausen, B. A., & Culpepper, B. J. (2011). Learning sparse representations of depth. IEEE Journal of Selected Topics in Signal Processing, 5(5), 941–952.
Traub, J. F. (1964). Iterative methods for the solution of equations. Englewood Cliffs: Prentice-Hall.
van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society B, 265(1412), 2315–2320.
Wang, Z., & Bovik, A. C. (2009). Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine, 26(1), 98–117.
Watson, A. B. (1994). Image compression using the discrete cosine transform. The Mathematica Journal, 4(1), 81–88.
Willmore, B., & Tolhurst, D. J. (2001). Characterizing the sparseness of neural codes. Network: Computation in Neural Systems, 12(3), 255–270.
Wilson, D. R., & Martinez, T. R. (2003). The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10), 1429–1451.
Yang, J., Wang, Z., Lin, Z., Cohen, S., & Huang, T. (2012). Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8), 3467–3478.
Yang, J., Wright, J., Huang, T., & Ma, Y. (2010). Image super-resolution via sparse representation. IEEE Transactions on Image Processing, 19(11), 2861–2873.
Zelnik-Manor, L., Rosenblum, K., & Eldar, Y. C. (2012). Dictionary optimization for block-sparse representations. IEEE Transactions on Signal Processing, 60(5), 2386–2395.
Acknowledgments
The authors are grateful to Heiko Neumann, Florian Schüle, and Michael Gabb for helpful discussions. We would like to thank Julien Mairal and Karl Skretting for making implementations of their algorithms available. Parts of this work were performed on the computational resource bwUniCluster funded by the Ministry of Science, Research and Arts and the Universities of the State of Baden-Württemberg, Germany, within the framework program bwHPC. This work was supported by Daimler AG, Germany.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Julien Mairal, Francis Bach, Michael Elad.
Appendix: Technical Details and Proofs for Section 2
Appendix: Technical Details and Proofs for Section 2
This appendix studies the algorithmic computation of Euclidean projections onto level sets of Hoyer’s \(\sigma \) in greater detail, and in particular proves the correctness of the algorithm proposed in Sect. 2.
For a non-empty subset \(M\subseteq \mathbb {R}^n\) of the Euclidean space and a point \(x\in \mathbb {R}^n\), we call
the set of Euclidean projections of \(x\) onto \(M\) (Deutsch 2001). Since we only consider situations in which \({{\mathrm{proj}}}_M(x) = \{y\}\) is a singleton, we may also write \(y = {{\mathrm{proj}}}_M(x)\).
Without loss of generality, we can compute \({{\mathrm{proj}}}_T(x)\) for a vector \(x\in \mathbb {R}_{\ge 0}^n\) within the non-negative orthant instead of \({{\mathrm{proj}}}_S(x)\) for an arbitrary point \(x\in \mathbb {R}^n\) to yield sparseness-enforcing projections, where \(T\) and \(S\) are as defined in Sect. 2. First, the actual scale is irrelevant as we can simply re-scale the result of the projection (Thom and Palm 2013, Remark 5). Second, the constraint that the projection lies in the non-negative orthant \(\mathbb {R}_{\ge 0}^n\) can easily be handled by flipping the signs of certain coordinates (Thom and Palm 2013, Lemma 11). Finally, all entries of \(x\) can be assumed non-negative with Corollary 19 from Thom and Palm (2013).
We note that \(T\) is non-convex because of the \(\left\| s\right\| _2 \!=\! \lambda _2\) constraint. Moreover, \(T\ne \emptyset \) for all target sparseness degrees \(\sigma ^*\in \left( 0,\ 1\right) \) which we show here by construction (see also Remark 18 in Thom and Palm (2013) for further details): Let and \(\omega \!:=\! \lambda _1 \!-\! (n \!-\! 1) \psi \!>\! 0\), then the point \(q := \sum _{i = 1}^{n - 1} \psi e_i + \omega e_n\in \mathbb {R}^n\) lies in \(T\), where \(e_i\in \mathbb {R}^n\) denotes the \(i\)-th canonical basis vector. Since all coordinates of \(q\) are positive, \(T\) always contains points with an \(L_0\) pseudo-norm of \(n\). If we had used the \(L_0\) pseudo-norm to measure sparseness, then \(q\) would have the same sparseness degree as, for example, the vector with all entries equal to unity. If, however, \(\sigma ^*\) is close to one, then there is only one large value \(\omega \) in \(q\) and all the other entries equaling \(\psi \) are very small but positive. This simple example demonstrates that in situations where the presence of noise cannot be eliminated, Hoyer’s \(\sigma \) is a much more robust sparseness measure than the \(L_0\) pseudo-norm.
1.1 Representation Theorem
Before proving a theorem on the characterization of the projection onto \(T\), we first fix some notation. As above, let \(e_i\in \mathbb {R}^n\) denote the \(i\)-th canonical basis vector and let furthermore \(e := \sum _{i=1}^ne_i\in \mathbb {R}^n\) be the vector where all entries are one. We note that if a point \(x\) resides in the non-negative orthant, then \(\left\| x\right\| _1 = e^Tx\). Subscripts to vectors denote the corresponding coordinate, except for \(e\) and \(e_i\). For example, we have that \(x_i = e_i^Tx\). We abbreviate \(\xi \in \mathbb {R}_{\ge 0}\) with \(\xi \ge 0\) when it is clear that \(\xi \) is a real number. When \(I\subseteq \{1,\ldots ,n\}\) is an index set with \(d := \left| I \right| \) elements, say \(i_1 < \cdots < i_d\), then the unique matrix \(V_I\in \{0, 1\}^{d\times n}\) with \(V_I x = \left( x_{i_1},\ \ldots ,\ x_{i_d}\right) ^T\in \mathbb {R}^d\) for all \(x\in \mathbb {R}^n\) is called the slicing operator. A useful relation between the \(L_0\) pseudo-norm, the Manhattan norm and the Euclidean norm is for all points \(x\in \mathbb {R}^n\).
We are now in a position to formalize the representation theorem:
Theorem 1
Let \(x\in \mathbb {R}_{\ge 0}^n\backslash T\) and \(p := {{\mathrm{proj}}}_T(x)\) be unique. Then there is exactly one number \(\alpha ^*\in \mathbb {R}\) such that
where is a scaling constant. Moreover, if \(I := \{i\in \{1,\ldots ,n\} \vert p_i > 0\} = \{i_1,\ldots ,i_d\}\), \(d := \left| I \right| \) and \(i_1 < \cdots < i_d\), denotes the set of the \(d\) coordinates in which \(p\) does not vanish, then
Proof
It is possible to prove this claim either constructively or implicitly, where both variants differ in whether the set \(I\) of all positive coordinates in the projection can be computed from \(x\) or must be assumed to be known. We first present a constructive proof based on a geometric analysis conducted in Thom and Palm (2013), which contributes to deepening our insight into the involved computations. As an alternative, we also provide a rigorous proof using the method of Lagrange multipliers which greatly enhances the unproven analysis of (Potluru et al. (2013), Section 3.1).
We first note that when there are \(\alpha ^*\in \mathbb {R}\) and \(\beta ^* > 0 \) so that we have \(p = \beta ^*\cdot \max \left( x - \alpha ^*\cdot e,\ 0\right) \), then \(\beta ^*\) is determined already through \(\alpha ^*\) because it holds that \(\lambda _2 = \left\| p\right\| _2 = \beta ^*\cdot \left\| \max \left( x - \alpha ^*\cdot e,\ 0\right) \right\| _2\). We now show that the claimed representation is unique, and then present the two different proofs for the existence of the representation.
Uniqueness: It is \(d\ge 2\), since \(d = 0\) would violate that \(\left\| p\right\| _1 > 0\) and \(d = 1\) is impossible because \(\sigma ^* \ne 1\). We first show that there are two distinct indices \(i,j\in I\) with \(p_i\ne p_j\). Assume this was not the case, then \(p_i =: \gamma \), say, for all \(i\in I\). Let \(j := \hbox {arg min}_{i\in I} x_i\) be the index of the smallest coordinate of \(x\) which has its index in \(I\). Let and \(\omega := \lambda _1 - (d - 1)\psi \in \mathbb {R}\) be numbers and define \(s := \sum _{i\in I\backslash \{j\}}\psi e_i + \omega e_j\in \mathbb {R}^n\). Then \(s\in T\) like in the argument where we have shown that \(T\) is non-empty. Because of \(\left\| p\right\| _1 = \left\| s\right\| _1\) and \(\left\| p\right\| _2 = \left\| s\right\| _2\), it follows that \(\left\| x - p\right\| _2^2 - \left\| x - s\right\| _2^2 = 2x^T\left( s - p\right) = 2\sum _{i\in I\backslash \{j\}} x_i(\psi - \gamma ) + 2x_j (\omega - \gamma ) \ge 2x_j((d-1)\psi + \omega - d\gamma ) = 2x_j\left( \left\| s\right\| _1 - \left\| p\right\| _1\right) = 0\). Hence \(s\) is at least as good an approximation to \(x\) as \(p\), violating the uniqueness of \(p\). Therefore, it is impossible that the set \(\{p_i \vert i\in I\}\) is a singleton.
Now let \(i,j\in I\) with \(p_i\!\ne \! p_j\) and \(\alpha _1^*,\alpha _2^*,\beta _1^*,\beta _2^*\!\in \!\mathbb {R}\) such that \(p\! =\! \beta _1^*\cdot \max \left( x \!-\! \alpha _1^*\cdot e,\ 0\right) \!=\! \beta _2^*\cdot \max \left( x \!- \!\alpha _2^*\cdot e,\ 0\right) \). Clearly \(\beta _1^*\ne 0\) and \(\beta _2^*\ne 0\) as \(d\ne 0\). It is \(0\ne p_i - p_j = \beta _1^*(x_i - \alpha _1^*) - \beta _1^*(x_j - \alpha _1^*) = \beta _1^*(x_i - x_j)\), thus \(x_i\ne x_j\) holds. Moreover, \(0 = p_i - p_j + p_j - p_i = (\beta _1^* - \beta _2^*)(x_i - x_j)\), hence \(\beta _1^* = \beta _2^*\). Finally, we have that \(0 = p_i - p_i = \beta _1^*(x_i - \alpha _1^*) - \beta _2^*(x_i - \alpha _2^*) = \beta _1^*(\alpha _2^* - \alpha _1^*)\), which yields \(\alpha _1^* = \alpha _2^*\), and hence the representation is unique.
Existence (constructive): Let \(H := \{a\in \mathbb {R}^n \vert e^Ta = \lambda _1\}\) be the hyperplane on which all points in the non-negative orthant have an \(L_1\) norm of \(\lambda _1\) and let \(C \!:=\! \mathbb {R}_{\ge 0}^n\!\cap \! H\) be a scaled canonical simplex. Further, let \(L \!:=\! \{q\!\in \! H \vert \left\| q\right\| _2 \!=\! \lambda _2\}\) be a circle on \(H\), and for an arbitrary index set \(I\subseteq \{1,\ldots ,n\}\) let \(L_I := \{a\in L \vert a_i = 0\text { for all }i\not \in I\}\) be a subset of \(L\) where all coordinates not indexed by \(I\) vanish. With Theorem 2 and Appendix D from Thom and Palm (2013) there exists a finite sequence of index sets \(I_1,\ldots ,I_h\subseteq \{1,\ldots ,n\}\) with \(I_j\supsetneq I_{j+1}\) for \(j\in \{1,\ldots ,h-1\}\) such that \({{\mathrm{proj}}}_T(x)\) is the result of the finite sequence
All intermediate projections yield unique results because \(p\) was restricted to be unique. The index sets contain the indices of the entries that survive the projections onto \(C\), \(I_j := \{i\in \{1,\ldots ,n\} \vert r_i(j) \ne 0\}\) for \(j\in \{1,\ldots ,h\}\). In other words, \(p\) can be computed from \(x\) by alternating projections, where the sets \(L\) and \(L_{I_j}\) are non-convex for all \(j\in \{1,\ldots ,h\}\). The expressions for the individual projections are given in Lemma 13, Lemma 17, Proposition 24, and Lemma 30, respectively, in Thom and Palm (2013).
Let \(I_0 := \{1,\ldots ,n\}\) for completeness, then we can define the following constants for \(j\in \{0,\ldots ,h\}\). Let \(d_j := \left| I_j \right| \) be the number of relevant coordinates in each iteration, and let
be real numbers. We have that \(d_j\lambda _2^2 - \lambda _1^2 \ge d_h\lambda _2^2 - \lambda _1^2 \ge 0\) by construction which implies \(\beta _j > 0\) for all \(j\in \{0,\ldots ,h\}\). We now claim that the following holds:
-
(a)
\(s_i(j) = \beta _j\cdot (x_i - \alpha _j)\) for all \(i\in I_j\) for all \(j\in \{0,\ldots ,h\}\).
-
(b)
\(\alpha _0 \le \cdots \le \alpha _h\) and \(\beta _0 \le \cdots \le \beta _h\).
-
(c)
\(x_i\le \alpha _j\) for all \(i\not \in I_j\) for all \(j\in \{0,\ldots ,h\}\).
-
(d)
\(p = \beta _h\cdot \max \left( x - \alpha _h\cdot e,\ 0\right) \).
We start by showing (a) with induction. For \(j = 0\), we have using Lemma 13 from Thom and Palm (2013). With Lemma 17 stated in Thom and Palm (2013), we have \(s(0) = \delta r(0) + (1 - \delta ) m\) with and . We see that and therefore \(\delta = \beta _0\), and thus , so the claim holds for the base case.
Suppose that (a) holds for \(j\) and we want to show it also holds for \(j + 1\). It is \(r(j+1) = {{\mathrm{proj}}}_C(s(j))\) by definition, and Proposition 31 in Thom and Palm (2013) implies \(r(j+1) = \max \left( s(j) - \hat{t}\cdot e,\ 0\right) \) where \(\hat{t}\in \mathbb {R}\) can be expressed explicitly as , which is the mean value of the entries that survive the simplex projection up to an additive constant. We note that \(\hat{t}\) is here always non-negative, see Lemma 28(a) in Thom and Palm (2013), which we will need to show (b). Since \(I_{j+1}\subsetneq I_j\) we yield \(s_i(j) = \beta _j\cdot (x_i - \alpha _j)\) for all \(i\in I_{j+1}\) with the induction hypothesis, and therefore we have that . We find that \(r_i(j+1) > 0\) for \(i\in I_{j+1}\) by definition, and we can omit the rectifier so that \(r_i(j+1) = s_i(j) - \hat{t}\). Using the induction hypothesis and the expression for \(\hat{t}\) we have . For projecting onto \(L_{I_{j+1}}\), the distance between \(r(j+1)\) and is required for computation of , so that Lemma 30 from Thom and Palm (2013) can be applied. We have that , and further . Now let \(i\in I_{j+1}\) be an index, then we have using Lemma 30 from Thom and Palm (2013). Therefore (a) holds for all \(j\in \{0,\ldots ,h\}\).
Let us now turn to (b). From the last paragraph, we know that for all \(j\in \{0,\ldots ,h-1\}\) for the projections onto \(L_{I_{j+1}}\). On the other hand, we have that from the proof of Lemma 30(a) from Thom and Palm (2013), and \(\Vert r(j+1)\Vert _2^2 \le \lambda _2^2\) holds from the proof of Lemma 28(f) in Thom and Palm (2013), so \(\delta \ge 1\) which implies \(\beta _0 \le \cdots \le \beta _h\). As noted above, the separator for projecting onto \(C\) satisfies \(\hat{t}\ge 0\) for all \(j\in \{0,\ldots ,h-1\}\). By rearranging this inequality and using \(\beta _j\le \beta _{j+1}\), we conclude that , hence \(\alpha _0 \le \cdots \le \alpha _h\).
For (c) we want to show that the coordinates in the original vector \(x\) which will vanish in some iteration when projecting onto \(C\) are already small. The base case \(j = 0\) of an induction for \(j\) is trivial since the complement of \(I_0\) is empty. In the induction step, we note that the complement of \(I_{j+1}\) can be partitioned into \(I_{j+1}^C= I_j^C\cup \big (I_j\cap I_{j+1}^C\big )\) since \(I_{j+1}\subsetneq I_j\). For \(i\in I_j^C\) we already know that \(x_i\le \alpha _j\le \alpha _{j+1}\) by the induction hypothesis and (b). We have shown in (a) that \(s_i(j) = \beta _j\cdot (x_i - \alpha _j)\) for all \(i\in I_j\), and if \(i\in I_j\backslash I_{j+1}\) then \(s_i(j) \le \hat{t}\) since \(0 = r_i(j+1) = \max \left( s_i(j) - \hat{t},\ 0\right) \). By substituting the explicit expression for \(\hat{t}\) and solving for \(x_i\) we yield , and hence the claim holds for all \(i\not \in I_{j+1}\).
If we can now show that (d) holds, then the claim of the theorem follows by setting \(\alpha ^* := \alpha _h\) and \(\beta ^* \!:=\! \beta _h\). We note that by construction \(p \!=\! s(h)\) and \(s_i(h) \!\ge \! 0\) for all coordinates \(i\!\in \!\{1,\ldots ,n\}\). When \(i\!\in \! I_h\), then \(s_i(h) \!=\! \beta _h\cdot (x_i - \alpha _h)\) with (a), which is positive by requirement, so when the rectifier is applied nothing changes. If \(i\not \in I_h\) then \(x_i - \alpha _h \le 0\) by (c), and indeed \(\beta _h\cdot \max \left( x_i - \alpha _h,\ 0\right) = 0 = s_i(h)\). The expression therefore holds for all \(i\in \{1,\ldots ,n\}\), which completes the constructive proof of existence.
Existence (implicit): Existence of the projection is guaranteed by the Weierstraß extreme value theorem since \(T\) is compact. Now let \(f:\mathbb {R}^n\rightarrow \mathbb {R}\), \(s\mapsto \Vert s - x\Vert _2^2\), be the objective function, and let the constraints be represented by the functions \(h_1:\mathbb {R}^n\rightarrow \mathbb {R}\), \(s\mapsto e^Ts - \lambda _1\), \(h_2:\mathbb {R}^n\rightarrow \mathbb {R}\), \(s\mapsto \Vert s\Vert _2^2 - \lambda _2^2\), and \(g_i:\mathbb {R}^n\rightarrow \mathbb {R}\), \(s\mapsto -s_i\), for all indices \(i\in \{1,\ldots ,n\}\). All these functions are continuously differentiable. If \(p = {{\mathrm{proj}}}_T(x)\), then \(p\) is a local minimum of \(f\) subject to \(h_1(p) = 0\), \(h_2(p) = 0\) and \(g_1(p) \le 0, \ldots , g_n(p) \le 0\).
For application of the method of Lagrange multipliers we first have to show that \(p\) is regular, which means that the gradients of \(h_1\), \(h_2\) and \(g_i\) for \(i\not \in I\) evaluated in \(p\) must be linearly independent (Bertsekas 1999, Section 3.3.1). Let \(J := I^C= \{j_1,\ldots ,j_{n - d}\}\), say, then \(\left| J \right| \le n - 2\) since \(d\ge 2\). Hence we have at most \(n\) vectors from \(\mathbb {R}^n\) for which we have to show linear independence. Clearly \(h_1'(s) = e\), \(h_2'(s) = 2s\) and \(g_i'(s) = -e_i\) for all \(i\in \{1,\ldots ,n\}\). Now let \(u_1,u_2\in \mathbb {R}\) and \(v\in \mathbb {R}^{n-d}\) with \(u_1 e + 2u_2p - \sum _{\mu =1}^{n - d} v_\mu e_{j_\mu } = 0\in \mathbb {R}^n\). Then, let \(\mu \in \{1,\ldots ,n-d\}\), then \(p_{j_\mu } = 0\) by definition of \(I\) and hence by pre-multiplication of the equation above with \(e_{j_\mu }^T\) we yield \(u_1 - v_\mu = 0\in \mathbb {R}\). Therefore \(u_1 = v_\mu \) for all \(\mu \in \{1,\ldots ,n-d\}\). On the other hand, if \(i\in I\) then \(p_i > 0\) and \(e_i^Te_{j_\mu } = 0\) for all \(\mu \in \{1,\ldots ,n-d\}\). Hence \(u_1 + 2u_2 p_i = 0\in \mathbb {R}\) for all \(i\in I\). In the first paragraph of the uniqueness proof we have shown there are two distinct indices \(i,j\in I\) with \(p_i\ne p_j\). Since \(u_1 + 2u_2 p_i = 0 = u_1 + 2u_2 p_j\) and thus \(0 = 2u_2(p_i - p_j)\) we can conclude that \(u_2 = 0\), which implies \(u_1 = 0\). Then \(v_1 = \dots = v_{n-d} = 0\) as well, which shows that \(p\) is regular.
The Lagrangian is \(\fancyscript{L}:\mathbb {R}^n\times \mathbb {R}\times \mathbb {R}\times \mathbb {R}^n\rightarrow \mathbb {R}\), \((s,\ \alpha ,\ \beta ,\ \gamma )\mapsto f(s) + \alpha h_1(s) + \beta h_2(s) + \sum _{i=1}^n \gamma _i g_i(s)\), and its derivative with respect to its first argument \(s\) is given by . Now, Proposition 3.3.1 from Bertsekas (1999) guarantees the existence of Lagrange multipliers \(\tilde{\alpha },\tilde{\beta }\in \mathbb {R}\) and \(\tilde{\gamma }\in \mathbb {R}^n\) with \(\fancyscript{L}'(p,\ \tilde{\alpha },\ \tilde{\beta },\ \tilde{\gamma }) = 0\), \(\tilde{\gamma }_i \ge 0\) for all \(i\in \{1,\ldots ,n\}\) and \(\tilde{\gamma }_i = 0\) for \(i\in I\). Assume \(\tilde{\beta } = -1\), then \(2x = \tilde{\alpha }\cdot e - \tilde{\gamma }\) since the derivative of \(\fancyscript{L}\) must vanish. Hence for all \(i\in I\), and therefore \(\{p_i \vert i\in I\}\) is a singleton with Remark 10 from Thom and Palm (2013) as \(p\) was assumed unique and \(T\) is permutation-invariant. We have seen earlier that this is absurd, so \(\tilde{\beta } \ne -1\) must hold.
Write , and for notational convenience. We then obtain \(p = \beta ^*(x - \alpha ^*\cdot e + \gamma ^*)\) because \(\fancyscript{L}'\) vanishes. Then \(h_1(p) = 0\) implies that \(\lambda _1 = \sum _{i\in I}p_i = \sum _{i\in I}\beta ^*(x_i - \alpha ^*) = \beta ^*(\Vert V_I x\Vert _1 - d\alpha ^*)\), and with \(h_2(p) = 0\) follows that \(\lambda _2^2 = \sum _{i\in I}p_i^2 = (\beta ^*)^2\cdot (\Vert V_I x\Vert _2^2 - 2\alpha ^*\Vert V_I x\Vert _1 + d\cdot (\alpha ^*)^2)\). By taking the ratio and after elementary algebraic transformations we arrive at the quadratic equation \(a\cdot (\alpha ^*)^2 + b\cdot \alpha ^* + c = 0\), where , and are reals. The discriminant is . Since \(V_I x\in \mathbb {R}^d\) we have that \(d\Vert V_I x\Vert _2^2 - \Vert V_I x\Vert _1^2 \ge 0\). Moreover, the number \(d\) is not arbitrary. As \(p\) exists by the Weierstraß extreme value theorem with \(\Vert p\Vert _0 = d\), \(\Vert p\Vert _1 = \lambda _1\) and \(\Vert p\Vert _2 = \lambda _2\), we find that \(\lambda _1 \le \sqrt{d}\lambda _2\) and hence \(\Delta \ge 0\), so there must be a real solution to the equation above. Solving the equation shows that
hence from \(h_1(p) = 0\) we obtain
Suppose \(\alpha ^*\) is the number that arises from the “\(+\)” before the square root, then \(\beta ^*\) is the number with the “\(-\)” sign, thus \(\beta ^* < 0\). We have seen earlier that there are two distinct indices \(i,j\in I\) with \(p_i\ne p_j\). We can assume \(p_i > p_j\), then \(0 < p_i - p_j = \beta ^*(x_i - x_j)\) which implies that \(x_i < x_j\). This is not possible as it violates the order-preservation property of projections onto permutation-invariant sets (Lemma 9(a) from Thom and Palm 2013). Thus our choice of \(\alpha ^*\) was not correct in the first place, and \(\alpha ^*\) must be as stated in the claim.
It remains to be shown that \(p\) is the result of a soft-shrinkage function. If \(i\in I\), then \(0 < p_i = \beta ^*(x_i - \alpha ^*)\), and \(\beta ^* > 0\) shows \(x_i > \alpha ^*\) such that \(p_i = \beta ^*\cdot \max \left( x_i - \alpha ^*,\ 0\right) \). When \(i\not \in I\), we have \(0 = p_i = \beta ^*(x_i - \alpha ^* + \gamma _i^*)\) where \(\gamma _i^* \ge 0\) and still \(\beta ^* > 0\), thus \(x_i \le \alpha ^*\) and \(p_i = \beta ^*\cdot \max \left( x_i - \alpha ^*,\ 0\right) \) holds. Therefore, the representation holds for all entries. \(\square \)
Finding the set \(I\) containing the indices of the positive coordinates of the projection result is the key for algorithmic computation of the projection. Based on the constructive proof this could, for example, be achieved by carrying out alternating projections whose run-time complexity is between quasi-linear and quadratic in the problem dimensionality \(n\) and whose space complexity is linear. An alternative is the method proposed by Potluru et al. (2013), where the input vector is sorted and then each possible candidate for \(I\) is checked. Due to the sorting, \(I\) must be of the form \(I = \{1,\ldots ,d\}\), where now only \(d\) is unknown (see also the proof of Theorem 3 from Thom and Palm 2013). Here, the run-time complexity is quasi-linear and the space complexity is linear in the problem dimensionality since also the sorting permutation has to be stored. When \(n\) gets large, algorithms with a smaller computational complexity are mandatory.
1.2 Properties of the Auxiliary Function
We have already informally introduced the auxiliary function in Sect. 2.2. Here is a satisfactory definition:
Definition 2
Let \(x\in \mathbb {R}_{\ge 0}^n\backslash T\) be a point such that \({{\mathrm{proj}}}_T(x)\) is unique and \(\sigma (x) < \sigma ^*\). Let \(x_{\max }\) denote the maximum entry of \(x\), then
is called auxiliary function for the projection onto \(T\).
We call \(\varPsi \) well-defined if all requirements from the definition are met. Note that the situation where \(\sigma (x) \ge \sigma ^*\) is trivial, because in this sparseness-decreasing setup we have that all coordinates of the projection must be positive. Hence \(I = \{1,\ldots ,n\}\) in Theorem 1, and \(\alpha ^*\) can be computed with the there provided formula.
We need more notation to describe the properties of \(\varPsi \). Let \(x\in \mathbb {R}^n\) be a point. We write \(\fancyscript{X}:= \{x_i \vert i\in \{1,\ldots ,n\}\!\}\) for the set that contains the entries of \(x\). Let \(x_{\min } := \min \fancyscript{X}\) be short for the smallest entry of \(x\), and \(x_{\max } := \max \fancyscript{X}\) and \(x_{\mathrm{2nd-max }} := \max \fancyscript{X}\backslash \{x_{\max }\}\) denote the two largest entries of \(x\). Further, \(q:\mathbb {R}\rightarrow \mathbb {R}^n\), \(\alpha \mapsto \max \left( x - \alpha \cdot e,\ 0\right) \), denotes the curve that evolves from application of the soft-shrinkage function to \(x\). The Manhattan norm and Euclidean norm of points from \(q\) is given by \(\ell _1:\mathbb {R}\rightarrow \mathbb {R}\), \(\alpha \mapsto \left\| q(\alpha )\right\| _1\), and \(\ell _2:\mathbb {R}\rightarrow \mathbb {R}\), \(\alpha \mapsto \left\| q(\alpha )\right\| _2\), respectively. Therefore, and \(\tilde{\varPsi }\) from Sect. 4.1 can be written as , such that its derivative can be expressed in terms of \(\varPsi '\) using the chain rule.
The next result provides statements on the auxiliary function’s analytical nature and links its zero with the projection onto \(T\):
Lemma 3
Let \(x\in \mathbb {R}_{\ge 0}^n\backslash T\) be given such that the auxiliary function \(\varPsi \) is well-defined. Then the following holds:
-
(a)
\(\varPsi \) is continuous on \(\left[ 0,\ x_{\max }\right) \).
-
(b)
\(\varPsi \) is differentiable on \(\left[ 0,\ x_{\max }\right) \backslash \fancyscript{X}\).
-
(c)
\(\varPsi \) is strictly decreasing on \(\left[ 0,\ x_{\mathrm{2nd-max }}\right) \), and on \(\left[ x_{\mathrm{2nd-max }},\ x_{\max }\right) \) it is constant.
-
(d)
There is exactly one \(\alpha ^*\in \left( 0,\ x_{\mathrm{2nd-max }}\right) \) with \(\varPsi (\alpha ^*) = 0\). It is then .
Proof
In addition to the original claim, we also give explicit expressions for the derivative of \(\varPsi \) and higher derivatives thereof in part (c). These are necessary to show that \(\varPsi \) is strictly decreasing and constant, respectively, on the claimed intervals and for the explicit implementation of Algorithm 2.
(a) The function \(q\) is continuous because so is the soft-shrinkage function. Hence \(\ell _1\), \(\ell _2\) and \(\varPsi \) are continuous as compositions of continuous functions.
(b) The soft-shrinkage function is differentiable everywhere except at its offset. Therefore, \(\varPsi \) is differentiable everywhere except for when its argument coincides with an entry of \(x\), that is on \(\left[ 0,\ x_{\max }\right) \backslash \fancyscript{X}\).
(c) We start with deducing the first and second derivative of \(\varPsi \). Let \(x_j,x_k\in \fancyscript{X}\cup \{0\}\), \(x_j < x_k\), such that there is no element from \(\fancyscript{X}\) between them. We here allow \(x_j = 0\) and \(x_k = x_{\min }\) when \(0\not \in \fancyscript{X}\) for completeness. Then the index set \(I := \{i\in \{1,\ldots ,n\} \vert x_i > \alpha \}\) of non-vanishing coordinates in \(q\) is constant for \(\alpha \in \left( x_j,\ x_k\right) \), and the derivative of \(\varPsi \) can be computed using a closed-form expression. For this, let \(d := \left| I \right| \) denote the number of non-vanishing coordinates in \(q\) on that interval. With \(\ell _1(\alpha ) = \sum _{i\in I}\left( x_i - \alpha \right) = \sum _{i\in I}x_i - d\alpha \) we obtain \(\ell _1'(\alpha ) = -d\). Analogously, it is , and the chain rule yields . Application of the quotient rule gives . The second derivative is of similar form. We find , and hence . We have and with the product rule we see that .
First let \(\alpha \in \left( x_{\mathrm{2nd-max }},\ x_{\max }\right) \). It is then \(d = 1\), that is \(q\) has exactly one non-vanishing coordinate. In this situation we find \(\ell _1(\alpha ) = \ell _2(\alpha )\) and \(\varPsi '\equiv 0\) on \(\left( x_{\mathrm{2nd-max }},\ x_{\max }\right) \), thus \(\varPsi \) is constant on \(\left( x_{\mathrm{2nd-max }},\ x_{\max }\right) \) as a consequence of the mean value theorem from real analysis. Because \(\varPsi \) is continuous, it is constant even on \(\left[ x_{\mathrm{2nd-max }},\ x_{\max }\right) \).
Next let \(\alpha \in \left[ 0,\ x_{\mathrm{2nd-max }}\right) \backslash \fancyscript{X}\), and let \(x_j\), \(x_k\), \(I\) and \(d\) as in the first paragraph. We have \(d\ge 2\) since \(\alpha < x_{\mathrm{2nd-max }}\). It is furthermore \(\ell _1(\alpha ) \le \sqrt{d}\ell _2(\alpha )\) as \(d = \Vert q(\alpha )\Vert _0\). This inequality is in fact strict, because \(q(\alpha )\) has at least two distinct nonzero entries. Hence \(\varPsi '\) is negative on the interval \(\left( x_j,\ x_k\right) \), and the mean value theorem guarantees that \(\varPsi \) is strictly decreasing on this interval. This property holds for the entire interval \(\left[ 0,\ x_{\mathrm{2nd-max }}\right) \) due to the continuity of \(\varPsi \).
(d) The requirement \(\sigma (x) < \sigma ^*\) implies . Let \(\alpha \in \left( x_{\mathrm{2nd-max }},\ x_{\max }\right) \) be arbitrary, then \(\ell _1(\alpha ) = \ell _2(\alpha )\) as in (c), and hence \(\varPsi (\alpha ) < 0\) since \(\lambda _2 < \lambda _1\) must hold. The existence of \(\alpha ^*\in \left[ 0,\ x_{\mathrm{2nd-max }}\right) \) with \(\varPsi (\alpha ^*) = 0\) then follows from the intermediate value theorem and (c). Uniqueness of \(\alpha ^*\) is guaranteed because \(\varPsi \) is strictly monotone on the relevant interval.
Define \(p \!:=\! {{\mathrm{proj}}}_T(x)\), then with Theorem 1 there is an \(\tilde{\alpha }\!\in \!\mathbb {R}\) so that . Since \(p\in T\) we obtain \(\varPsi (\tilde{\alpha }) = 0\), and the uniqueness of the zero of \(\varPsi \) implies that \(\alpha ^* = \tilde{\alpha }\). \(\square \)
As described in Sect. 2.3, the exact value of the zero \(\alpha ^*\) of \(\varPsi \) can be found by inspecting the neighboring entries in \(x\) of a candidate offset \(\alpha \). Let \(x_j,x_k\in \fancyscript{X}\) be these neighbors with \(x_j\le \alpha < x_k\) such that there is no element from \(\fancyscript{X}\) between \(x_j\) and \(x_k\). When \(\varPsi \) changes its sign from \(x_j\) to \(x_k\), we know that its root must be located within this interval. Further, we then know that all coordinates with a value greater than \(x_j\) must survive the sparseness projection, which yields the set \(I\) from Theorem 1 and thus the explicit representation of the projection. The next result gathers these ideas and shows that it is easy to verify whether a change of sign in \(\varPsi \) is on hand.
Lemma 4
Let \(x\in \mathbb {R}_{\ge 0}^n\backslash T\) be given such that \(\varPsi \) is well-defined and let \(\alpha \in \left[ 0,\ x_{\max }\right) \) be arbitrary. If \(\alpha < x_{\min }\), let \(x_j := 0\) and \(x_k := x_{\min }\). Otherwise, let \(x_j := \max \{x_i \vert x_i\in \fancyscript{X}\text { and }x_i\le \alpha \}\) be the left neighbor and let \(x_k := \min \{x_i \vert x_i\in \fancyscript{X}\text { and }x_i > \alpha \}\) be the right neighbor of \(\alpha \). Both exist as the sets where the maximum and the minimum are taken are nonempty. Define \(I := \{i\in \{1,\ldots ,n\} \vert x_i > \alpha \}\) and \(d := \left| I \right| \). Then:
-
(a)
When \(\varPsi (x_j)\ge 0\) and \(\varPsi (x_k) < 0\) then there is exactly one number \(\alpha ^*\in \left[ x_j,\ x_k\right) \) with \(\varPsi (\alpha ^*) = 0\).
-
(b)
It is \(\ell _1(\xi ) = \left\| V_I x\right\| _1 - d\xi \) and \(\ell _2^2(\xi ) = \left\| V_I x\right\| _2^2 - 2\xi \left\| V_I x\right\| _1 + d\xi ^2\) for \(\xi \in \{x_j,\ \alpha ,\ x_k\}\).
-
(c)
If the inequalities \(\lambda _2\ell _1(x_j) \ge \lambda _1\ell _2(x_j)\) and \(\lambda _2\ell _1(x_k) < \lambda _1\ell _2(x_k)\) are satisfied and \(p := {{\mathrm{proj}}}_T(x)\) denotes the projection of \(x\) onto \(T\), then \(I = \{i\in \{1,\ldots ,n\} \vert p_i > 0\}\) and hence \(p\) can be computed exactly with Theorem 1.
Proof
The claim in (a) is obvious with Lemma 3.
(b) We find that \(\ell _1(\alpha ) = \sum _{i\in I}(x_i - \alpha ) = \Vert V_I x\Vert _1 - d\alpha \) and \(\ell _2(\alpha )^2 = \sum _{i\in I}(x_i - \alpha )^2 = \Vert V_I x\Vert _2^2 - 2\alpha \Vert V_I x\Vert _1 + d\alpha ^2\).
We have \(K = I \backslash \tilde{K}\) with \(K := \{i\in \{1,\ldots ,n\} \vert x_i > x_k\}\) and \(\tilde{K} := \{i\in \{1,\ldots ,n\} \vert x_i = x_k\}\). One yields that \(\ell _1(x_k) = \sum _{i\in K}(x_i - x_k) = \sum _{i\in I}(x_i - x_k) - \sum _{i\in \tilde{K}}(x_i - x_k) = \Vert V_I x\Vert _1 - d x_k\). The claim for \(\ell _2(x_k)^2\) follows analogously.
Likewise \(I = J\backslash \tilde{J}\) with \(J := \{i\in \{1,\ldots ,n\} \vert x_i > x_j\}\) and \(\tilde{J} := \{i\in \{1,\ldots ,n\} \vert x_i = x_j\}\), hence follows \(\ell _1(x_j) = \sum _{i\in J}(x_i - x_j) = \sum _{i\in I}(x_i - x_j) + \sum _{i\in \tilde{J}}(x_i - x_j) = \Vert V_I x\Vert _1 - d x_j\). The value of \(\ell _2(x_j)^2\) can be computed in the same manner.
(c) The condition in the claim is equivalent to the case of \(\varPsi (x_j) \ge 0\) and \(\varPsi (x_k) < 0\), hence with (a) there is a number \(\alpha ^*\in \left[ x_j,\ x_k\right) \) with \(\varPsi (\alpha ^*) = 0\). Note that \(\alpha \ne \alpha ^*\) in general. Write \(p := {{\mathrm{proj}}}_T(x)\) and let \(J := \{i\in \{1,\ldots ,n\} \vert p_i > 0\}\). With Theorem 1 follows that \(i\in J\) if and only if \(x_i > \alpha ^*\). But this is equivalent to \(x_i > x_j\), which in turn is equivalent to \(x_i > \alpha \), therefore \(I = J\) must hold. Thus we already had the correct set of non-vanishing coordinates of the projection in the first place, and \(\alpha ^*\) and \(\beta ^*\) can be computed exactly using the formula from the claim of Theorem 1, which yields the projection \(p\). \(\square \)
1.3 Proof of Correctness of Projection Algorithm
In Sect. 2.3, we informally described our proposed algorithm for carrying out sparseness-enforcing projections, and provided a simplified flowchart in Fig. 3. After the previous theoretical considerations, we now propose and prove the correctness of a formalized algorithm for the projection problem. Here, the overall method is split into an algorithm that evaluates the auxiliary function \(\varPsi \) and, based on its derivative, returns additional information required for finding its zero (Algorithm 2). The other part, Algorithm 3, implements the root-finding procedure and carries out the necessary computations to yield the result of the projection. It furthermore returns information that will be required for computation of the projection’s gradient, which we will discuss below.
Theorem 5
Let \(x\in \mathbb {R}_{\ge 0}^n\) and \(p := {{\mathrm{proj}}}_T(x)\) be unique. Then Algorithm 3 computes \(p\) in a number of operations linear in the problem dimensionality \(n\) and with only constant additional space.
Proof
We start by analyzing Algorithm 2, which evaluates \(\varPsi \) at any given position \(\alpha \). The output includes the values of the auxiliary function, its first and second derivative, and the value of the transformed auxiliary function and its derivative. There is moreover a Boolean value which indicates whether the interval with the sign change of \(\varPsi \) has been found, and three additional numbers required to compute the zero \(\alpha ^*\) of \(\varPsi \) as soon as the correct interval has been found.
Let \(I := \{i\in \{1,\ldots ,n\} \vert x_i > \alpha \}\) denote the indices of all entries in \(x\) larger than \(\alpha \). In the blocks from Line 1 to Line 11, the algorithm scans through all the coordinates of \(x\). It identifies the elements of \(I\), and computes numbers \(\ell _1\), \(\ell _2^2\), \(d\), \(x_j\) and \(x_k\) on the fly. After Line 11, we clearly have that \(\ell _1 = \left\| V_I x\right\| _1\), \(\ell _2^2 = \left\| V_I x\right\| _2^2\) and \(d = \left| I \right| \). Additionally, \(x_j\) and \(x_k\) are the left and right neighbors, respectively, of \(\alpha \). Therefore, the requirements of Lemma 4 are satisfied.
The next two blocks spanning from Line 12 to Line 17 compute scalar numbers according to Lemma 4(b), the definition of \(\varPsi \), the first two derivatives thereof given explicitly in the proof of Lemma 3(c), the definition of \(\tilde{\varPsi }\) and its derivative given by the chain rule. In Line 18, it is checked whether the conditions from Lemma 4(c) hold using the statements from Lemma 4(b). The result is stored in the Boolean variable “\({{\mathrm{finished}}}\)”.
Finally all computed numbers are passed back for further processing. Algorithm 2 clearly needs time linear in \(n\) and only constant additional space.
Algorithm 3 performs the actual projection in-place, and outputs values needed for the gradient of the projection. It uses Algorithm 2 as sub-program by calls to the function “\({{\mathrm{auxiliary}}}\)”. The algorithm first checks whether \(\varPsi (0) \le 0\), which is fulfilled when \(\sigma (x) \ge \sigma ^*\). In this case, all coordinates survive the projection and computation of \(\alpha ^*\) is straightforward with Theorem 1 using \(I = \{1,\ldots ,n\}\).
Otherwise, Lemma 3(d) states that \(\alpha ^*\in \left( 0,\ x_{\mathrm{2nd-max }}\right) \). We can find \(\alpha ^*\) numerically with standard root-finding algorithms since \(\varPsi \) is continuous and strictly decreasing on the interval \(\left( 0,\ x_{\mathrm{2nd-max }}\right) \). The concrete variant is chosen by the parameter “\({{\mathrm{solver}}}\)” of Algorithm 3, implementation details can be found in Traub (1964) and Press et al. (2007).
Here, the root-finding loop starting at Line 5 is terminated once Algorithm 2 indicates that the correct interval for exact computation of the zero \(\alpha ^*\) has been identified. It is therefore not necessary to carry out root-finding until numerical convergence, it is enough to only come sufficiently close to \(\alpha ^*\). Line 19 computes \(\alpha ^*\) based on the projection representation given in Theorem 1. This line is either reached directly from Line 2 if \(\sigma (x) \ge \sigma ^*\), or when the statements from Lemma 4(c) hold. The block starting at Line 20 computes \(\max \left( x - \alpha ^*\cdot e,\ 0\right) \) and stores this point’s squared Euclidean norm in the variable \(\rho \). Line 24 computes the number \(\beta ^*\) from Theorem 1 and multiplies every entry of \(\max \left( x - \alpha ^*\cdot e,\ 0\right) \) with it, such that \(x\) finally contains the projection onto \(T\). It would also be possible to create a new vector for the projection result and leave the input vector untouched, at the expense of additional memory requirements which are linear in the problem dimensionality.
When \({{\mathrm{solver}}}= {{\mathrm{Bisection}}}\), the loop in Line 5 is repeated a constant number of times regardless of \(n\) (see Sect. 2.3), and since Algorithm 2 terminates in time linear in \(n\), Algorithm 3 needs time only linear in \(n\). Further, the amount of additional memory needed is independent of \(n\), as for Algorithm 2, such that the overall space requirements are constant. Therefore, Algorithm 3 is asymptotically optimal in the sense of complexity theory. \(\square \)
1.4 Gradient of the Projection
Thom and Palm (2013) have shown that the projection onto \(T\) can be grasped as a function almost everywhere which is differentiable almost everywhere. An explicit expression for the projection’s gradient was derived, which depended on the number of alternating projections required for carrying out the projection. Based on the characterization we gained through Theorem 1, we can derive a much simpler expression for the gradient which is also more efficient to compute:
Theorem 6
Let \(x\in \mathbb {R}_{\ge 0}^n\backslash T\) such that \(p := {{\mathrm{proj}}}_T(x)\) is unique. Let \(\alpha ^*,\beta ^*\in \mathbb {R}\), \(I\subseteq \{1,\ldots ,n\}\) and \(d := \left| I \right| \) be given as in Theorem 1. When \(x_i\ne \alpha ^*\) for all \(i\in \{1,\ldots ,n\}\), then \({{\mathrm{proj}}}_T\) is differentiable in \(x\) with , where the matrix \(G\in \mathbb {R}^{d\times d}\) is given by
with \(a := d\left\| V_I x\right\| _2^2 - \left\| V_I x\right\| _1^2\in \mathbb {R}_{\ge 0}\) and \(b := d\lambda _2^2 - \lambda _1^2\in \mathbb {R}_{\ge 0}\). Here, \(E_d\in \mathbb {R}^{d\times d}\) is the identity matrix, \(\tilde{e} := V_I e\in \{1\}^d\) is the point where all coordinates are unity, and \(\tilde{p} := V_I p\in \mathbb {R}_{>0}^d\).
Proof
When \(x_i\ne \alpha ^*\) for all \(i\in \{1,\ldots ,n\}\), then \(I\) and \(d\) are invariant to local changes in \(x\). Therefore, \(\alpha ^*\), \(\beta ^*\) and \({{\mathrm{proj}}}_T\) are differentiable as composition of differentiable functions. In the following, we derive the claimed expression of the gradient of \({{\mathrm{proj}}}_T\) in \(x\).
Let \(\tilde{x} := V_I x\in \mathbb {R}^d\), then . Define \(\tilde{q} \!:=\! V_I\cdot \max \left( x \!-\! \alpha ^*\cdot e,\ 0\right) \), then \(\tilde{q} \!=\! \tilde{x} \!-\! \alpha ^*\cdot \tilde{e}\!\in \!\mathbb {R}_{> 0}^d\) because \(x_i > \alpha ^*\) for all \(i\in I\). Further , and we have \(p = V_I^TV_I p = V_I^T\tilde{p}\) since \(p_i = 0\) for all \(i\not \in I\). Application of the chain rule yields
and with follows , thus it only remains to show \(G = H\).
One obtains . Since all the entries of \(\tilde{q}\) and \(\tilde{x}\) are positive, their \(L_1\) norms equal the dot product with \(\tilde{e}\). We have , and we obtain that . Now we can compute the gradient of \(\alpha \). Clearly \(b\) is independent of \(\tilde{x}\). It is . Since \(\tilde{x} = \tilde{q} + \alpha ^*\cdot \tilde{e}\) follows , and hence . Therefore, we have .
By substitution into \(H\) and multiplying out we see that
where and have been used. The claim then follows with and . \(\square \)
The gradient has a particular simple form, as it is essentially a scaled identity matrix with additive combination of scaled dyadic products of simple vectors. In the situation where not the entire gradient but merely its product with an arbitrary vector is required (as for example in conjunction with the backpropagation algorithm), simple vector operations are already enough to compute the product:
Corollary 7
Algorithm 4 computes the product of the gradient of the sparseness projection with an arbitrary vector in time and space linear in the problem dimensionality \(n\).
Proof
Let \(y\in \mathbb {R}^n\) be an arbitrary vector in the situation of Theorem 6, and define \(\tilde{y} := V_I y\in \mathbb {R}^d\). Then one obtains
Algorithm 4 starts by computing the sliced vectors \(\tilde{p}\) and \(\tilde{y}\), and computes “\(\mathrm {sum}_{\tilde{y}}\)” which equals \(\tilde{e}^T\tilde{y}\) and “\(\mathrm {scp}_{\tilde{p},\,\tilde{y}}\)” which equals \(\tilde{p}^T\tilde{y}\) after Line 5. It then computes \(a\) and \(b\) using the numbers output by Algorithm 3. From Line 7 to Line 9, the product \(G\tilde{y}\) is computed in-place by scaling of \(\tilde{y}\), adding a scaled version of \(\tilde{p}\), and adding a scalar to each coordinate. Since , it just remains to invert the slicing. The complexity of the algorithm is clearly linear, both in time and space. \(\square \)
It has not escaped our notice that Corollary 7 can also be used to determine the eigensystem of the projection’s gradient, which may prove useful for further analysis of gradient-based learning methods involving the sparseness-enforcing projection operator.
Rights and permissions
About this article
Cite this article
Thom, M., Rapp, M. & Palm, G. Efficient Dictionary Learning with Sparseness-Enforcing Projections. Int J Comput Vis 114, 168–194 (2015). https://doi.org/10.1007/s11263-015-0799-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0799-8