1 Introduction

In Statistics, the field of nonparametric inference under shape constraints dates back at least to [33], who studied the nonparametric maximum likelihood estimator of a decreasing density on the non-negative half line. But it is really over the last decade or so that researchers have begun to realize its full potential for addressing key contemporary data challenges such as (multivariate) density estimation and regression. The initial allure is the flexibility of a nonparametric model, combined with estimation methods that can often avoid the need for tuning parameter selection, which can often be troublesome for other nonparametric techniques such as those based on smoothing. Intensive research efforts over recent years have revealed further great attractions: for instance, these procedures frequently attain optimal rates of convergence over relevant function classes. Moreover, it is now known that shape-constrained procedures can possess intriguing adaptation properties, in the sense that they can estimate particular subclasses of functions at faster rates, even (nearly) as well as the best one could do if one were told in advance that the function belonged to this subclass.

Typically, however, the implementation of shape-constrained estimation techniques requires the solution of an optimization problem, and, despite some progress, there are several cases where computation remains a bottleneck and hampers the adoption of these methods by practitioners. In this work, we focus on the problem of log-concave density estimation, which has become arguably the central challenge in the field because the class of log-concave densities enjoys stability properties under marginalization, conditioning, convolution and linear transformations that make it a very natural infinite-dimensional generalization of the class of Gaussian densities [60].

The univariate log-concave density estimation problem was first studied in [68], and fast algorithms for the computation of the log-concave maximum likelihood estimator (MLE) in one dimension are now available through the R packages logcondens [27] and cnmlcd [49]. [20] introduced and studied the multivariate log-concave maximum likelihood estimator, but their algorithm, which is described below and implemented in the R package LogConcDEAD [18], is slow; for instance, [20] report a running time of 50 s for computing the bivariate log-concave MLE with 500 observations, and 224 min for computing the log-concave MLE in four dimensions with 2,000 observations. An alternative, interior point method for a suitable approximation was proposed by [46]. Recent progress on theoretical aspects of the computational problem in the computer science community includes [2], who proved that there exists a polynomial time algorithm for computing the log-concave maximum likelihood estimator. We are unaware of any attempt to implement this algorithm. [57] compute an approximation to the log-concave MLE by considering \(-\log p\) as a piecewise affine maximum function, using the log-sum-exp operator to approximate the non-smooth operator, a Riemann sum to compute the integral and its gradient, and obtain a solution via L-BFGS. This reformulation means that the problem is no longer convex.

To describe the problem more formally, let \({\mathcal {C}}_d\) denote the class of proper, convex lower-semicontinuous functions \(\varphi :\mathbb {R}^d \rightarrow (-\infty ,\infty ]\) that are coercive in the sense that \(\varphi ({\varvec{x}}) \rightarrow \infty \) as \(\Vert {\varvec{x}}\Vert \rightarrow \infty \). The class of upper semi-continuous log-concave densities on \(\mathbb {R}^d\) is denoted as

$$\begin{aligned} {\mathcal {P}}_d:= \biggl \{p:\mathbb {R}^d \rightarrow [0,\infty ): p = e^{-\varphi } \text { for some } \varphi \in {\mathcal {C}}_d, \int _{\mathbb {R}^d} p = 1\biggr \}. \end{aligned}$$

Given \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n \in \mathbb {R}^d\), [20, Theorem 1] proved that whenever the convex hull \(C_n\) of \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n\) is d-dimensional, there exists a unique

$$\begin{aligned} {\hat{p}}_n \in \mathop {\textrm{argmax}}\limits _{p \in {\mathcal {P}}_d} \frac{1}{n}\sum _{i=1}^n \log p({\varvec{x}}_i). \end{aligned}$$
(1)

If \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n\) are regarded as realizations of independent and identically distributed random vectors on \(\mathbb {R}^d\), then the objective function in (1) is a scaled version of the log-likelihood function, so \({\hat{p}}_n\) is called the log-concave MLE. The existence and uniqueness of this estimator is not obvious, because the infinite-dimensional class \({\mathcal {P}}_d\) is non-convex, and even the class of negative log-densities \(\bigl \{\varphi \in {\mathcal {C}}_d:\int _{\mathbb {R}^d} e^{-\varphi } = 1\bigr \}\) is non-convex. In fact, the estimator belongs to a finite-dimensional subclass; more precisely, for a vector \({\varvec{\phi }}= (\phi _1,\ldots ,\phi _n) \in \mathbb {R}^n\), define \(\textrm{cef}[{\varvec{\phi }}] \in {\mathcal {C}}_d\) to be the (pointwise) largest function with

$$\begin{aligned} \textrm{cef}[{\varvec{\phi }}]({\varvec{x}}_i) \le \phi _i \end{aligned}$$

for \(i=1,\ldots ,n\). [20] proved that \({\hat{p}}_n = e^{-\textrm{cef}[{\varvec{\phi }}^*]}\) for some \({\varvec{\phi }}^* \in \mathbb {R}^n\), and refer to the function \(-\textrm{cef}[{\varvec{\phi }}^*]\) as a ‘tent function’; see the illustration in Fig. 1. [20] further defined the non-smooth, convex objective function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) by

$$\begin{aligned} f({\varvec{\phi }}) \equiv f(\phi _1,\ldots ,\phi _n):= \frac{1}{n}\sum _{i=1}^n \phi _i + \int _{C_n} \exp \bigl \{-\textrm{cef}[{\varvec{\phi }}](x)\bigr \} \, dx, \end{aligned}$$
(2)

and proved that \({\varvec{\phi }}^* = \mathop {\textrm{argmin}}\nolimits _{{\varvec{\phi }}\in \mathbb {R}^n} f({\varvec{\phi }})\).

Fig. 1
figure 1

An illustration of a tent function, taken from [20]

The two main challenges in optimizing the objective function f in (2) are that the value and subgradient of the integral term are hard to evaluate, and that it is non-smooth, so vanilla subgradient methods lead to a slow rate of convergence. To address the first issue, [20] computed the exact integral and its subgradient using the qhull algorithm [4] to obtain a triangulation of the convex hull of the data, evaluating the function value and subgradient over each simplex in the triangulation. However, in the worst case, the triangulation can have \(O(n^{d/2})\) simplices [50]. The non-smoothness is handled via Shor’s r-algorithm [66, Chapter 3], as implemented by [42]. In Sect. 2, we characterize the subdifferential of the objective function in terms of the solution of a linear program (LP), and show that the solution lies in a known, compact subset of \(\mathbb {R}^n\). This understanding allows us to introduce our new computational framework for log-concave density estimation in Sect. 3, based on an accelerated version of a dual averaging approach [53]. This relies on smoothing the objective function, and encompasses two popular strategies, namely Nesterov smoothing [52] and randomized smoothing [25, 48, 73], as special cases. A further feature of our algorithm is the construction of approximations to gradients of our smoothed objective, and this in turn requires an approximation to the integral in (2). While a direct application of the theory of [25] would yield a rate of convergence for the objective function of order \(n^{1/4}/T + 1/\sqrt{T}\) after T iterations, we show in Sect. 4 that by introducing finer approximations of both the integral and its gradient as the iteration number increases, we can obtain an improved rate of order 1/T, up to logarithmic factors. Moreover, we translate the optimization error in the objective into a bound on the error in the log-density, which is uncommon in the literature in the absence of strong convexity. A further advantage of our approach is that we are able to extend it in Sect. 5 to the more general problem of quasi-concave density estimation [46, 65], thereby providing a computationally tractable alternative to the discrete Hessian approach of [46]. Section 6 illustrates the practical benefits of our methodology in terms of improved computational timings on simulated data. Additional experimental details and applications on real data sets are provided in Appendix A. Proofs of all main results can be found in Appendix B, and background on the field of nonparametric inference under shape constraints can be found in Appendix C.

Notation: We write \([n]:= \{1,2,\ldots , n\}\), let \({\varvec{1}} \in \mathbb {R}^n\) denote the all-ones vector, and denote the cardinality of a set S by |S|. For a Borel measurable set \(C\subseteq \mathbb {R}^d\), we use \(\textrm{vol}(C)\) to denote its volume (i.e. d-dimensional Lebesgue measure). We write \(\Vert \cdot \Vert \) for the Euclidean norm of a vector. For \(\mu > 0\), a convex function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) is said to be \(\mu \)-strongly convex if \({\varvec{\phi }}\mapsto f({\varvec{\phi }})-\frac{\mu }{2}\Vert {\varvec{\phi }}\Vert ^2\) is convex. The notation \(\partial f({\varvec{\phi }})\) denotes the subdifferential (set of subgradients) of f at \({\varvec{\phi }}\). Given a real-valued sequence \((a_n)\) and a positive sequence \((b_n)\), we write \(a_n = {\tilde{O}}(b_n)\) if there exist \(C,\gamma > 0\) such that \(a_n \le C b_n \log ^\gamma (1+n)\) for all \(n \in \mathbb {N}\).

2 Understanding the structure of the optimization problem

Throughout this paper, we assume that \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n \in \mathbb {R}^d\) are distinct and that their convex hull \(C_n:=\textrm{conv}({\varvec{x}}_1,\ldots ,{\varvec{x}}_n)\) has nonempty interior, so that \(n \ge d+1\) and \(\varDelta := \textrm{vol}(C_n)>0\). This latter assumption ensures the existence and uniqueness of a minimizer of the objective function in (2) [28, Theorem 2.2]. Recall that we define the lower convex envelope function [59] \(\textrm{cef}: \mathbb {R}^n \rightarrow {\mathcal {C}}_d\) by

$$\begin{aligned} \textrm{cef}[{\varvec{\phi }}]({\varvec{x}}) \equiv \textrm{cef}[(\phi _1,\ldots ,\phi _n)]({\varvec{x}}):= \sup \bigl \{g({\varvec{x}}):g\in {\mathcal {C}}_d, g({\varvec{x}}_i)\le \phi _i~\forall i \in [n]\bigr \}. \nonumber \\ \end{aligned}$$
(3)

As mentioned in the introduction, in computing the MLE, we seek

$$\begin{aligned} {\varvec{\phi }}^*:= \mathop {\textrm{argmin}}\limits _{{\varvec{\phi }}\in \mathbb {R}^n} f({\varvec{\phi }}), \end{aligned}$$
(4)

where

$$\begin{aligned} f({\varvec{\phi }}):= \frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+\int _{C_n}\exp \{-\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\}\, d{\varvec{x}} =: \frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+I({\varvec{\phi }}). \end{aligned}$$
(5)

Note that (4) can be viewed as a stochastic optimization problem by writing

$$\begin{aligned} f({\varvec{\phi }})=\mathbb {E}F({\varvec{\phi }},{\varvec{\xi }}), \end{aligned}$$
(6)

where \({\varvec{\xi }}\) is uniformly distributed on \(C_n\) and where, for \({\varvec{x}}\in C_n\),

$$\begin{aligned} F({\varvec{\phi }},{\varvec{x}}):=\frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+\varDelta e^{-\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})}. \end{aligned}$$
(7)

Let \({\varvec{X}}:= [{\varvec{x}}_1 \, \cdots \, {\varvec{x}}_n]^\top \in \mathbb {R}^{n \times d}\), and for \({\varvec{x}}\in \mathbb {R}^d\), let \(E({\varvec{x}}):= \bigl \{{\varvec{\alpha }}\in \mathbb {R}^n:{\varvec{X}}^\top {\varvec{\alpha }}= {\varvec{x}}, {\varvec{1}}_n^\top {\varvec{\alpha }}=1, {\varvec{\alpha }}\ge 0\bigr \}\) denote the set of all weight vectors for which \({\varvec{x}}\) can be written as a weighted convex combination of \({\varvec{x}}_1,\ldots ,{\varvec{x}}_n\). Thus \(E({\varvec{x}})\) is a compact, convex subset of \(\mathbb {R}^n\). The \(\textrm{cef}\) function is given by a linear program (LP) [2, 46]:

figure a

If \({\varvec{x}}\notin C_n\), then \(E({\varvec{x}}) =\emptyset \), and, with the standard convention that \(\inf \emptyset := \infty \), we see that (\(Q_0\)) agrees with (3). From the LP formulation, it follows that \({\varvec{\phi }}\mapsto \textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\) is concave, for every \({\varvec{x}}\in \mathbb {R}^d\).

Given a pair \({\varvec{\phi }}\in \mathbb {R}^n\) and \({\varvec{x}}\in C_n\), an optimal solution to (\(Q_0\)) may not be unique, in which case the map \({\varvec{\phi }}\mapsto \textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\) is not differentiable [7, Proposition B.25(b)]. Noting that the infimum in (\(Q_0\)) is attained whenever \({\varvec{x}}\in C_n\), let

$$\begin{aligned} A[{\varvec{\phi }}]({\varvec{x}})&:= \textrm{conv}\bigl (\bigl \{{\varvec{\alpha }}\in E({\varvec{x}}): {\varvec{\alpha }}^\top {\varvec{\phi }}= \textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\bigr \}\bigr ) \\&= \bigl \{{\varvec{\alpha }}\in E({\varvec{x}}): {\varvec{\alpha }}^\top {\varvec{\phi }}= \textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\bigr \}. \end{aligned}$$

Danskin’s theorem [7, Proposition B.25(b)] applied to \(-\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\) then yields that for each \({\varvec{x}}\in C_n\), the subdifferential of \(F({\varvec{\phi }},{\varvec{x}})\) with respect to \({\varvec{\phi }}\) is given by

$$\begin{aligned} \partial F({\varvec{\phi }},{\varvec{x}}):=\biggl \{\frac{1}{n}{\varvec{1}}-\varDelta e^{-\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})}{\varvec{\alpha }}:{\varvec{\alpha }}\in A[{\varvec{\phi }}]({\varvec{x}})\biggr \}. \end{aligned}$$
(8)

Since both f and \(F(\cdot ,{\varvec{x}})\) are finite convex functions on \(\mathbb {R}^n\) (for each fixed \({\varvec{x}}\in C_n\) in the latter case), by [17, Proposition 2.3.6(b) and Theorem 2.7.2], the subdifferential of f at \({\varvec{\phi }}\in \mathbb {R}^n\) is given by

$$\begin{aligned} \partial f({\varvec{\phi }}):= \bigl \{\mathbb {E}{\varvec{G}}({\varvec{\phi }},{\varvec{\xi }}):{\varvec{G}}({\varvec{\phi }},{\varvec{x}}) \in \partial F({\varvec{\phi }},{\varvec{x}}) \text { for each } {\varvec{x}}\in C_n\bigr \}. \end{aligned}$$
(9)

Observe that given any \({\varvec{\phi }}\in \mathbb {R}^n\), the function \({\varvec{x}}\mapsto -\textrm{cef}\bigl [{\varvec{\phi }}+ \log I({\varvec{\phi }}){\varvec{1}}\bigr ]({\varvec{x}})\) (where \(I({\varvec{\phi }})\) is the integral defined in (5)) is a log-density. It is also convenient to let \({\bar{{\varvec{\phi }}}} \in \mathbb {R}^n\) be such that \(\exp \{-\textrm{cef}[{\bar{{\varvec{\phi }}}}]\}\) is the uniform density on \(C_n\), so that \(f({{\bar{{\varvec{\phi }}}}})=\log \varDelta +1\). Proposition 1 below (an extension of [2, Lemma 2]) provides uniform upper and lower bounds on this log-density, whenever the objective function f evaluated at \({\varvec{\phi }}\) is at least as good as that at \({\bar{{\varvec{\phi }}}}\). In more statistical language, these bounds hold whenever the log-likelihood of the density \(\exp \bigl \{-\textrm{cef}\bigl [{\varvec{\phi }}+ \log I({\varvec{\phi }}){\varvec{1}}\bigr ](\cdot )\bigr \}\) is at least as large as that of the uniform density on the convex hull of the data, so in particular, they must hold for the log-concave MLE (i.e. when \({\varvec{\phi }}= {\varvec{\phi }}^*\)). Let \(\phi ^0:= (n-1) +d(n-1)\log \bigl (2n + 2nd \log (2nd)\bigr ) +\log \varDelta \) and \(\phi _0:=-1 - d \log \bigl (2n + 2nd \log (2nd)\bigr ) +\log \varDelta \).

Proposition 1

For any \({\varvec{\phi }}\in \mathbb {R}^n\) such that \(f({\varvec{\phi }})\le \log \varDelta +1\), we have \(\phi _0\le \phi _i+\log I({\varvec{\phi }})\le \phi ^0\) for all \(i \in [n]\).

The following corollary is an immediate consequence of Proposition 1.

Corollary 1

Suppose that \({\varvec{\phi }}\in \mathbb {R}^n\) satisfies \(I({\varvec{\phi }})=1\) and \(f({\varvec{\phi }})\le f({{\bar{{\varvec{\phi }}}}}) = \log \varDelta +1\). Then \({\varvec{\phi }}^* \in \mathbb {R}^n\) defined in (4) satisfies

$$\begin{aligned} \Vert {\varvec{\phi }}-{\varvec{\phi }}^*\Vert \le \sqrt{n}(\phi ^0-\phi _0). \end{aligned}$$

Corollary 1 gives a sense in which any \({\varvec{\phi }}\in \mathbb {R}^n\) for which the objective function is ‘good’ cannot be too far from the optimizer \({\varvec{\phi }}^*\); here, ‘good’ means that the objective should be no larger than that of the uniform density on the convex hull of the data. Moreover, an upper bound on the integral \(I({\varvec{\phi }})\) provides an upper bound on the norm of any subgradient \({\varvec{g}}({\varvec{\phi }})\) of f at \({\varvec{\phi }}\).

Proposition 2

Any subgradient \({\varvec{g}}({\varvec{\phi }}) \in \mathbb {R}^n\) of f at \({\varvec{\phi }}\in \mathbb {R}^n\) satisfies \(\Vert {\varvec{g}}({\varvec{\phi }})\Vert ^2\le \max \bigl \{1/n + 1/4,I({\varvec{\phi }})^2\bigr \}.\)

3 Computing the log-concave MLE

As mentioned in the introduction, subgradient methods [56, 66] tend to be slow for minimizing the objective function f defined in (5) [20]. Our alternative approach involves the minimizing the representation of f given in (6) via smoothing techniques, which offer superior computational guarantees and practical performance in our numerical experiments.

3.1 Smoothing techniques

We present two smoothing techniques to find the minimizer \({\varvec{\phi }}^* \in \mathbb {R}^n\) of the nonsmooth convex optimization problem (4). By Proposition 1, we have that \({\varvec{\phi }}^* \in {{\varvec{\varPhi }}}\), where

$$\begin{aligned} {\varvec{\varPhi }}:= \{{\varvec{\phi }}=(\phi _1,\ldots ,\phi _n)\in \mathbb {R}^n:\phi _0 \le \phi _i\le \phi ^0 \text { for } i \in [n]\}, \end{aligned}$$
(10)

with \(\phi _0,\phi ^0 \in \mathbb {R}\). In what follows we present two smoothing techniques: one based on Nesterov smoothing [52] and the second on randomized smoothing [25].

3.1.1 Nesterov smoothing

Recall that the non-differentiability in f in (5) is due to the LP (\(Q_0\)) potentially having multiple optimal solutions. Therefore, following [52], we consider replacing this LP with the following quadratic program (QP):

figure b

where \({\varvec{\alpha }}_0:=(1/n){\varvec{1}}\in \mathbb {R}^n\) is the center of \(E({\varvec{x}})\), and where \(u\ge 0\) is a regularization parameter that controls the extent of the quadratic regularization of the objective. With this definition, we have \(q_0[{\varvec{\phi }}]({\varvec{x}})=\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\). For \(u>0\), due to the strong convexity of the function \({\varvec{\alpha }} \mapsto {\varvec{\alpha }}^\top {\varvec{\phi }}+ ({u}/2)\Vert {\varvec{\alpha }}-{\varvec{\alpha }}_0\Vert ^2\) on the convex polytope \(E({\varvec{x}})\), (\(Q_u\)) admits a unique solution that we denote by \({\varvec{\alpha }}^*_u[{\varvec{\phi }}]({\varvec{x}})\). It follows again from Danskin’s theorem that \({\varvec{\phi }}\mapsto q_u[{\varvec{\phi }}]({\varvec{x}})\) is differentiable for such u, with gradient \(\nabla _{{\varvec{\phi }}} q_u[{\varvec{\phi }}]({\varvec{x}}) = {\varvec{\alpha }}^*_u[{\varvec{\phi }}]({\varvec{x}})\).

Using \(q_{u}[{\varvec{\phi }}]({\varvec{x}})\) instead of \(q_{0}[{\varvec{\phi }}]({\varvec{x}})\) in (5), we obtain a smooth objective \({\varvec{\phi }}\mapsto {\tilde{f}}_u({\varvec{\phi }})\), given by

$$\begin{aligned} {\tilde{f}}_u({\varvec{\phi }}) := \frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+\int _{C_n}\exp \{-q_u[{\varvec{\phi }}]({\varvec{x}})\}~d{\varvec{x}}=\mathbb {E}{{\tilde{F}}}_u({\varvec{\phi }},{\varvec{\xi }}), \end{aligned}$$
(11)

where \({\tilde{F}}_u({\varvec{\phi }},{\varvec{x}}):=({1}/{n}){\varvec{1}}^\top {\varvec{\phi }}+\varDelta \exp \{-q_u[{\varvec{\phi }}]({\varvec{x}})\}\), and where \({\varvec{\xi }}\) is again uniformly distributed on \(C_n\). We may differentiate under the integral (e.g. [45, Theorem 6.28]) to see that the partial derivatives of \({\tilde{f}}_u\) with respect to each component of \({\varvec{\phi }}\) exist, and moreover they are continuous (because \({\varvec{\phi }}\mapsto {\varvec{\alpha }}^*_u[{\varvec{\phi }}]({\varvec{x}})\) is continuous by Proposition 5), so \(\nabla _{{\varvec{\phi }}} {\tilde{f}}_u({\varvec{\phi }})= \mathbb {E}[\tilde{{\varvec{G}}}_u({\varvec{\phi }},{\varvec{\xi }})]\), where

$$\begin{aligned} \tilde{{\varvec{G}}}_u({\varvec{\phi }},{\varvec{x}}):= \nabla _{{\varvec{\phi }}} {\tilde{F}}_u({\varvec{\phi }},{\varvec{x}})=\frac{1}{n}{\varvec{1}}-\varDelta e^{-q_u[{\varvec{\phi }}]({\varvec{x}})}\alpha ^*_u[{\varvec{\phi }}]({\varvec{x}}). \end{aligned}$$
(12)

Proposition 3 below presents some properties of the smooth objective \({\tilde{f}}_u\).

Proposition 3

For any \({\varvec{\phi }}\in {{\varvec{\varPhi }}}\), we have

(a) \(0 \le {\tilde{f}}_u({\varvec{\phi }})-{\tilde{f}}_{u'}({\varvec{\phi }})\le \frac{u-u'}{2}e^{u'/2}I({\varvec{\phi }})\) for \(u' \in [0,u]\);

(b) For every \(u \ge 0\), the function \({\varvec{\phi }}\mapsto {\tilde{f}}_u({\varvec{\phi }})\) is convex and \(\varDelta e^{-\phi _0+u/2}\)-Lipschitz;

(c) For every \(u \ge 0\), the function \({\varvec{\phi }}\mapsto {\tilde{f}}_u({\varvec{\phi }})\) has \(\varDelta e^{-\phi _0+u/2}(1+u^{-1})\)-Lipschitz gradient;

(d) \(\mathbb {E}\bigl (\Vert {\tilde{{\varvec{G}}}_u({\varvec{\phi }},{\varvec{\xi }})-\nabla _{{\varvec{\phi }}}{\tilde{f}}_u({\varvec{\phi }})}\Vert ^2\bigr ) \le (\varDelta e^{-\phi _0+u/2})^2\) for every \(u \ge 0\).

3.1.2 Randomized smoothing

Our second smoothing technique is randomized smoothing [25, 48, 73]: we take the expectation of a random perturbation of the argument of f. Specifically, for \(u \ge 0\), let

$$\begin{aligned} {\bar{f}}_u({\varvec{\phi }}):=\mathbb {E}f({\varvec{\phi }}+u{\varvec{z}}), \end{aligned}$$
(13)

where \({\varvec{z}}\) is uniformly distributed on the unit \(\ell _{2}\)-ball in \(\mathbb {R}^n\). Thus, similar to Nesterov smoothing, \({\bar{f}}_0 = f\), and the amount of smoothing increases with u. From a stochastic optimization viewpoint, we can write

$$\begin{aligned} {\bar{f}}_u({\varvec{\phi }})=\mathbb {E}F({\varvec{\phi }}+u{\varvec{z}},{\varvec{\xi }})~~~~\text {and}~~~\nabla _{{\varvec{\phi }}} {\bar{f}}_u({\varvec{\phi }})=\mathbb {E}{\varvec{G}}({\varvec{\phi }}+u{\varvec{z}},{\varvec{\xi }}) \end{aligned}$$

where \({\varvec{G}}({\varvec{\phi }}+u{\varvec{v}},{\varvec{x}}) \in \partial F({\varvec{\phi }}+u{\varvec{v}},{\varvec{x}})\), and where the expectations are taken over independent random vectors \({\varvec{z}}\), distributed uniformly on the unit Euclidean ball in \(\mathbb {R}^n\), and \({\varvec{\xi }}\), distributed uniformly on \(C_n\). Here the gradient expression follows from, e.g., [48, Lemma 3.3(a)], [73, Lemma 7]; since \(F({\varvec{\phi }}+u{\varvec{v}},{\varvec{x}})\) is differentiable almost everywhere with respect to \({\varvec{\phi }}\), the expression for \({\bar{f}}_u({\varvec{\phi }})\) does not depend on the choice of subgradient.

Proposition 4 below lists some properties of \({\bar{f}}_{u}\) and its gradient. It extends [73, Lemmas 7 and 8] by exploiting special properties of the objective function to sharpen the dependence of the bounds on n.

Proposition 4

For any \(u \ge 0\) and \({\varvec{\phi }}\in {{\varvec{\varPhi }}}\), we have

(a) \(0\le {\bar{f}}_u({\varvec{\phi }})-f({\varvec{\phi }})\le I({\varvec{\phi }})ue^u\sqrt{\frac{2\log n}{n+1}}\);

(b) \({\bar{f}}_{u'}({\varvec{\phi }})\le {\bar{f}}_u({\varvec{\phi }})\) for any \(u' \in [0,u]\);

(c) \({\varvec{\phi }}\mapsto {\bar{f}}_u({\varvec{\phi }})\) is convex and \(\varDelta e^{-\phi _0+u}\)-Lipschitz;

(d) \({\varvec{\phi }}\mapsto {\bar{f}}_u({\varvec{\phi }})\) has \(\varDelta e^{-\phi _0+u}n^{1/2}/u\)-Lipschitz gradient;

(e) \(\mathbb {E}\bigl (\bigl \Vert {\varvec{G}}({\varvec{\phi }}+u{\varvec{z}},{\varvec{\xi }})-\nabla {\bar{f}}_u({\varvec{\phi }})\bigr \Vert ^2\bigr ) \le (\varDelta e^{-\phi _0+u})^2\) whenever \({\varvec{G}}({\varvec{\phi }}+u{\varvec{v}},{\varvec{x}}) \in \partial F({\varvec{\phi }}+u{\varvec{v}},{\varvec{x}})\) for every \({\varvec{v}} \in \mathbb {R}^n\) with \(\Vert {\varvec{v}}\Vert \le 1\) and \({\varvec{x}}\in C_n\).

3.2 Stochastic first-order methods for smoothing sequences

Our proposed algorithm for computing the log-concave MLE is given in Algorithm 1. It relies on the choice of a smoothing sequence of f, which may be constructed using Nesterov or randomized smoothing, for instance. For a non-negative sequence \((u_t)_{t \in \mathbb {N}_0}\), this smoothing sequence is denoted by \((\ell _{u_t})_{t \in \mathbb {N}_0}\), where \(\ell _{u_t}:={{\tilde{f}}}_{u_t}\) is given by (11) or \(\ell _{u_t}:={{\bar{f}}}_{u_t}\) is given by (13). In Algorithm 1, \(P_{\varvec{\varPhi }}:\mathbb {R}^n \rightarrow \varvec{\varPhi }\) denotes the projection operator onto the closed convex set \(\varvec{\varPhi }\), which is essentially a threshold clipping operator. In fact, Algorithm 1 is a modification of an algorithm due to [25], and can be regarded as an accelerated version of the dual averaging scheme [53] applied to \((\ell _{u_t})\).

Algorithm 1
figure c

Accelerated stochastic dual averaging on a smoothing sequence with increasing grids

3.2.1 Approximating the gradient of the smoothing sequence

In Line 3 of Algorithm 1, we need to compute an approximation of the gradient \(\nabla _{{\varvec{\phi }}} \ell _{u}\), for a general \(u \ge 0\). A key step in this process is to approximate the integral \(I(\cdot )\), as well as a subgradient of I, at an arbitrary \({\varvec{\phi }}\in \mathbb {R}^n\). [20] provide explicit formulae for these quantities, based on a triangulation of \(C_n\), using tools from computational geometry. For practical purposes, [21] apply a Taylor expansion to approximate the analytic expression. The R package LogConcDEAD [18] uses this method to evaluate the exact integral at each iteration, but since this is time-consuming, we will only use this method at the final stage of our proposed algorithm as a polishing stepFootnote 1.

An alternative approach is to use numerical integration.Footnote 2 Among deterministic schemes, [57] observed empirically that the simple Riemann sum with uniform weights appears to perform the best among several multi-dimensional integration techniques. Random (Monte Carlo) approaches to approximate the integral are also possible: given a collection of grid points \({\mathcal {S}}=\{{\varvec{\xi }}_1,\ldots ,{\varvec{\xi }}_m\}\), we approximate the integral as \(I_{{\mathcal {S}}}({\varvec{\phi }}):= ({\varDelta }/{m})\sum _{\ell =1}^m \exp \{-\textrm{cef}[{\varvec{\phi }}]({\varvec{\xi }}_\ell )\}.\) This leads to an approximation of the objective f given by

$$\begin{aligned} f({\varvec{\phi }}) \approx \frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+ I_{{\mathcal {S}}}({\varvec{\phi }}) =: f_{{\mathcal {S}}}({\varvec{\phi }}). \end{aligned}$$
(14)

Since \(f_{{\mathcal {S}}}\) is a finite, convex function on \(\mathbb {R}^n\), it has a subgradient at each \({\varvec{\phi }}\in \mathbb {R}^n\), given by

$$\begin{aligned} {\varvec{g}}_{{\mathcal {S}}}({\varvec{\phi }}):= \frac{1}{m}\sum _{\ell =1}^m {\varvec{G}}({\varvec{\phi }},{\varvec{\xi }}_\ell ). \end{aligned}$$

As the effective domain of \(\textrm{cef}[{\varvec{\phi }}](\cdot )\) is \(C_{n}\), we consider grid points \({\mathcal {S}} \subseteq C_{n}\).

We now illustrate how these ideas allow us to approximate the gradient of the smoothing sequence, and initially consider Nesterov smoothing, with \(\ell _u={\tilde{f}}_u\). If \({\mathcal {S}}=\{{\varvec{\xi }}_1,\ldots ,{\varvec{\xi }}_m\} \subseteq C_n\) denotes a collection of grid points (either deterministic or Monte Carlo based), then \(\nabla _{{\varvec{\phi }}} \ell _{u}\) can be approximated by \(\tilde{{\varvec{g}}}_{u,{\mathcal {S}}}\), where

$$\begin{aligned} \tilde{{\varvec{g}}}_{u,{\mathcal {S}}}({\varvec{\phi }}):= \frac{1}{n}{\varvec{1}}-\frac{\varDelta }{m}\sum _{j=1}^me^{-q_u[{\varvec{\phi }}]({\varvec{\xi }}_j)}\alpha ^*_u[{\varvec{\phi }}]({\varvec{\xi }}_j). \end{aligned}$$
(15)

In fact, we distinguish the cases of deterministic and random \({\mathcal {S}}\) by writing this approximation as \(\tilde{{\varvec{g}}}_{u,{\mathcal {S}}}^{\textrm{D}}\) and \(\tilde{{\varvec{g}}}_{u,{\mathcal {S}}}^{\textrm{R}}\) respectively.

For the randomized smoothing method with \(\ell _u={\bar{f}}_u\), the approximation is slightly more involved. Given m grid points \({\mathcal {S}}=\{{\varvec{\xi }}_1,\ldots ,{\varvec{\xi }}_m\} \subseteq C_n\) (again either deterministic or random), and independent random vectors \({\varvec{z}}_1,\ldots ,{\varvec{z}}_m\), each uniformly distributed on the unit Euclidean ball in \(\mathbb {R}^n\), we can approximate \(\nabla _{{\varvec{\phi }}} \ell _{u}({\varvec{\phi }})\) by

$$\begin{aligned} \bar{{\varvec{g}}}_{u,{\mathcal {S}}}^{\circ }({\varvec{\phi }}) = \frac{1}{n}{\varvec{1}}-\frac{\varDelta }{m}\sum _{j=1}^me^{-\textrm{cef}[{\varvec{\phi }}+u{\varvec{z}}_j]({\varvec{\xi }}_j)}\alpha ^*[{\varvec{\phi }}+u{\varvec{z}}_j]({\varvec{\xi }}_j),\end{aligned}$$
(16)

with \(\circ \in \{\textrm{D},\textrm{R}\}\) again distinguishing the cases of deterministic and random \({\mathcal {S}}\).

3.2.2 Related literature

As mentioned above, Algorithm 1 is an accelerated version of the dual averaging method of [53], which to the best of our knowledge has not been studied in the context of log-concave density estimation previously. Nevertheless, related ideas have been considered for other optimization problems (e.g. [25, 70]). Relative to previous work, our approach is quite general, in that it applies to both of the smoothing techniques discussed in Sect. 3.1, and allows the use of both deterministic and random grids to approximate the gradients of the smoothing sequence. Another key difference with earlier work is that we allow the grid \({\mathcal {S}}\) to depend on t, so we write it as \({\mathcal {S}}_t\), with \(m_t:= |{\mathcal {S}}_t|\); in particular, inspired by both our theoretical results and numerical experience, we take \((m_{t})\) to be a suitable increasing sequence.

4 Theoretical analysis of optimization error of Algorithm 1

We have seen in Propositions 3 and 4 that the two smooth functions \({\tilde{f}}_{u}\) and \({\bar{f}}_{u}\) enjoy similar properties — according to Proposition 3(a) to (c) and Proposition 4(a) to (d), both \({\tilde{f}}_{u}\) and \({\bar{f}}_{u}\) satisfy the following assumption:

Assumption 1

[Assumptions on smoothing sequence] There exists \(r \ge 0\) such that for any \({\varvec{\phi }}\in {{\varvec{\varPhi }}}\),

(a) we can find \(B_0 > 0\) with \(f({\varvec{\phi }})\le \ell _{u}({\varvec{\phi }})\le f({\varvec{\phi }})+B_0I({\varvec{\phi }})u\) for all \(u \in [0,r]\);

(b) \(\ell _{u'}({\varvec{\phi }})\le \ell _u({\varvec{\phi }})\) for all \(u' \in [0,u]\);

(c) for each \(u \in [0,r]\), the function \({\varvec{\phi }}\mapsto \ell _u({\varvec{\phi }})\) is convex and has \({B_1}/{u}\)-Lipschitz gradient, for some \(B_1>0\).

Recall from Sect. 3 that we have four possible choices corresponding to a combination of the smoothing and integral approximation methods, as summarized in Table 1.

Table 1 Summary of options for smoothing and gradient approximation methods

Once we select an option, in line 3 of Algorithm 1, we can take

$$\begin{aligned} {\varvec{g}}_t=\check{{\varvec{g}}}_{u_t,{\mathcal {S}}_t}^{\circ }({\varvec{\phi }}_t^{(y)}), \end{aligned}$$

where \(\check{~} \in \{\tilde{~},~\bar{~}\}\) and \(\circ \in \{\textrm{D},\textrm{R}\}\). To encompass all four approximation choices in Line 3 of Algorithm 1, we make the following assumption on the gradient approximation error \({\varvec{e}}_t:= {\varvec{g}}_t - \nabla _{{\varvec{\phi }}} \ell _{u_t}({\varvec{\phi }}_t^{(y)})\):

Assumption 2

[Gradient approximation error] There exists \(\sigma > 0\) such that

$$\begin{aligned} \mathbb {E}\bigl (\Vert {\varvec{e}}_t\Vert ^2|{\mathcal {F}}_{t-1}\bigr ) \le {\sigma ^2}/{m_t}~~~~\text {for all } t \in \mathbb {N}_0, \end{aligned}$$
(17)

where \({\mathcal {F}}_{t-1}\) denotes the \(\sigma \)-algebra generated by all random sources up to iteration \(t-1\) (with \({\mathcal {F}}_{-1}\) denoting the trivial \(\sigma \)-algebra).

When \({\mathcal {S}}\) is a Monte Carlo random grid (options 2 and 4), the approximate gradient \({\varvec{g}}_t\) is an average of \(m_t\) independent and identically distributed random vectors, each being an unbiased estimator of \(\nabla \ell _{u_t}({\varvec{\phi }}_t^{(y)})\). Hence, (17) holds true with \(\sigma ^2\) determined by the bounds in Proposition 3(d) (option 2) and Proposition 4(e) (option 4). For a deterministic Riemann grid \({\mathcal {S}}\) and Nesterov’s smoothing technique (option 1), \({\varvec{e}}_t\) is deterministic, and arises from using \(\tilde{{\varvec{g}}}_{u,{\mathcal {S}}}({\varvec{\phi }})\) in (15) to approximate \(\nabla _{{\varvec{\phi }}} {\tilde{f}}_u({\varvec{\phi }}) =\mathbb {E}[\tilde{{\varvec{G}}}_u({\varvec{\phi }},{\varvec{\xi }})]\). For the deterministic Riemann grid and randomized smoothing (option 3), the error \({\varvec{e}}_{t}\) can be decomposed into a random estimation error term (induced by \({\varvec{z}}_1,\ldots ,{\varvec{z}}_{m_t}\)) and a deterministic approximation error term (induced by \({\varvec{\xi }}_1,\ldots ,{\varvec{\xi }}_{m_t}\)) as follows:

$$\begin{aligned} {\varvec{e}}_t&=\frac{1}{m_t}\sum _{j=1}^{m_t}\bigl ({\varvec{G}}({\varvec{\phi }}_t^{(y)}+u_t{\varvec{z}}_j,{\varvec{\xi }}_j)-\mathbb {E}[{\varvec{G}}({\varvec{\phi }}_t^{(y)}+u_t{\varvec{z}},{\varvec{\xi }}_j)|{\mathcal {F}}_{t-1}]\bigr ) \\&\hspace{1cm}+\biggl (\frac{1}{m_t}\sum _{j=1}^{m_t}\mathbb {E}[{\varvec{G}}({\varvec{\phi }}_t^{(y)}+u_t{\varvec{z}},{\varvec{\xi }}_j)|{\mathcal {F}}_{t-1}]-\mathbb {E}[{\varvec{G}}({\varvec{\phi }}_t^{(y)}+u_t{\varvec{z}},{\varvec{\xi }})|{\mathcal {F}}_{t-1}]\biggr ). \end{aligned}$$

It can be shown using this decomposition that \(\mathbb {E}(\Vert {\varvec{e}}_t\Vert ^2|{\mathcal {F}}_{t-1}) = O(1/m_t)\) under regularity conditions.

Theorem 1 below establishes our desired computational guarantees for Algorithm 1. We write \(D:=\sup _{{\varvec{\phi }},{\tilde{{\varvec{\phi }}}}\in {{\varvec{\varPhi }}}} \Vert {\varvec{\phi }}-{\tilde{{\varvec{\phi }}}} \Vert \) for the diameter of \({\varvec{\varPhi }}\).

Theorem 1

Suppose that Assumptions 1 and 2 hold, and define the sequence \((\theta _t)_{t \in \mathbb {N}_0}\) by \(\theta _0:= 1\) and \(\theta _{t+1}:= 2\bigl (1+\sqrt{1+4/\theta _t^2}\bigr )^{-1}\) for \(t \in \mathbb {N}_0\). Let \(u > 0\), let \(u_t:= \theta _t u\) and take \(L_t=B_1/u_t\) and \(\eta _t=\eta \) for all \(t \in \mathbb {N}_0\) as input parameters to Algorithm 1. Writing \(M_T^{(1)}:=\sqrt{\sum _{t=0}^{T-1}m_t^{-1}}\) and \(M_T^{(1/2)}:=\sum _{t=0}^{T-1}m_t^{-1/2}\), we have for any \({\varvec{\phi }}\in {{\varvec{\varPhi }}}\) that

$$\begin{aligned} \mathbb {E}[f({\varvec{\phi }}_T^{(x)})]-f({\varvec{\phi }})\le \frac{B_1D^2}{Tu}+\frac{4B_0I({\varvec{\phi }})u}{T}+\frac{\eta D^2}{T}+\frac{\sigma ^2(M_T^{(1)})^2}{T\eta }+\frac{2D\sigma M_T^{(1/2)}}{T}.\nonumber \\ \end{aligned}$$
(18)

In particular, taking \({\varvec{\phi }}={\varvec{\phi }}^*\), and choosing \(u=({D}/{2})\sqrt{B_1/B_0}\) and \(\eta =({\sigma M_T^{(1)}})/{D}\), we obtain

$$\begin{aligned} \varepsilon _{T}:= \mathbb {E}[f({\varvec{\phi }}_T^{(x)})]-f({\varvec{\phi }}^*)\le \frac{4\sqrt{B_0B_1}D}{T}+\frac{2\sigma DM_T^{(1)}}{T}+\frac{2D\sigma M_T^{(1/2)}}{T}. \end{aligned}$$
(19)

Moreover, if we further assume that \(\mathbb {E}({\varvec{e}}_t|{\mathcal {F}}_{t-1})={\varvec{0}}\) (e.g. by using options 2 and 4), then we can remove the last term of both inequalities above.

For related general results that control the expected optimization error for smoothed objective functions, see, e.g., [52, 67, 25, 70]. With deterministic grids (corresponding to options 1 and 3), if we take \(|{\mathcal {S}}_{t}| = m\) for all t, then \(M_T^{(1/2)}=T/\sqrt{m}\), and the upper bound in (19) does not converge to zero as \(T \rightarrow \infty \). On the other hand, if we take \(|{\mathcal {S}}_t|=t^2\), for example, then \(\sup _{T \in \mathbb {N}} M_T^{(1)} < \infty \) and \(M_T^{(1/2)}= {\tilde{O}}(1)\), and we find that \(\varepsilon _{T} = {\tilde{O}}(1/T)\). For random grids (options 2 and 4), if we take \(|{\mathcal {S}}_{t}| = m\) for all t, then \(M_T^{(1)}=\sqrt{T/m}\) and we recover the \(\varepsilon _{T} = O(1/\sqrt{T})\) rate for stochastic subgradient methods [56]. This can be improved to \(\varepsilon _{T} ={\tilde{O}}(1/T)\) with \(m_t = t\), or even \(\varepsilon _{T} = O(1/T)\) if we choose \((m_t)_t\) such that \(\sum _{t=0}^\infty m_t^{-1} < \infty \).

A direct application of the theory of [25] would yield an error rate of \(\varepsilon _{T} = O(n^{1/4}/T + 1/\sqrt{T})\). On the other hand, Theorem 1 shows that, owing to the increasing sequence of grid sizes used to approximate the gradients in Step 3 of Algorithm 1, we can improve this rate to \({\tilde{O}}(1/T)\). Note however, that this improvement is in terms of the number of iterations T, and not the total number of stochastic oracle queries (equivalently, the total number of LPs (\(Q_0\))), which is given by \(T_{\textrm{query}}:=\sum _{t=0}^{T-1}m_t\). [1] and [51] have shown that the optimal expected number of stochastic oracle queries is of order \(1/\sqrt{T_{\textrm{query}}}\), which is attained by the algorithm of [25]. For our framework, by taking \(m_t=t\), we have \(T_{\textrm{query}}=\sum _{t=0}^{T-1} m_t={\tilde{O}}(T^2)\), so after \(T_{\textrm{query}}\) stochastic oracle queries, our algorithm also attains the optimal error on the objective function scale, up to a logarithmic factor. Other advantages of our algorithm and the theoretical guarantees provided by Theorem 1 relative to the earlier contributions of [25] are that we do not require an upper bound on \(I({\varvec{\phi }})\) and are able to provide a unified framework that includes Nesterov smoothing and an alternative gradient approximation approach by numerical integration in addition to randomized smoothing scheme with stochastic gradients. Moreover, we can exploit the specific structure of the log-concave density estimation problem to provide much better Lipschitz constants for the randomized smoothing sequence than would be obtained using the generic constants of [25]. For example, our upper bound in Proposition 4(a) is of order \(O(n^{-1/2}\log ^{1/2}n)\), whereas a naive application of the general theory of [25] would only yield a bound of O(1). A further improvement in our bound comes from the fact that it now involves \(I({\varvec{\phi }})\) directly, as opposed to an upper bound on this quantity.

In Theorem 1, the computational guarantee depends upon \(B_0,B_1,\sigma \) in Assumptions 1 and 2. In light of Propositions 3 and 4, Table 2 illustrates how these quantities, and hence the corresponding guarantees, differ according to whether we use Nesterov smoothing or randomized smoothing.

The randomized smoothing procedure requires solving LPs, whereas Nesterov’s smoothing technique requires solving QPs. While both of these problems are highly structured and can be solved efficiently by off-the-shelf solvers (e.g., [36]), we found the LP solution times to be faster than those for the QP. Additional computational details are discussed in Sect. 6.

Table 2 Comparion of constants in Assumption 1 for different smoothing schemes with \(u \in [0,r]\)

Note that Theorem 1 presents error bounds in expectation, though for option 1, since we use Nesterov’s smoothing technique and the Riemann sum approximation of the integral, the guarantee in Theorem 1 holds without the need to take an expectation. Theorem 2 below presents corresponding high-probability guarantees. For simplicity, we present results for options 2 and 4, which rely on the following assumption:

Assumption 3

Assume that \(\mathbb {E}({\varvec{e}}_t|{\mathcal {F}}_{t-1})={\varvec{0}}\) and that \(\mathbb {E}\bigl (e^{\Vert {\varvec{e}}_t\Vert ^2/\sigma _t^2} \mid {\mathcal {F}}_{t-1}\bigr ) \le e\), where \(\sigma _t=\sigma /\sqrt{m_t}\).

Theorem 2

Suppose that Assumptions 1 and 3 hold, and define the sequence \((\theta _t)_{t \in \mathbb {N}_0}\) by \(\theta _0:= 1\) and \(\theta _{t+1}:= 2\bigl (1+\sqrt{1+4/\theta _t^2}\bigr )^{-1}\) for \(t \in \mathbb {N}_0\). Let \(u > 0\), let \(u_t:= \theta _t u\) and take \(L_t=B_1/u_t\) and \(\eta _t=\eta \) for all \(t \in \mathbb {N}_0\) as input parameters to Algorithm 1. Writing \(M_T^{(2)}:=\sqrt{\sum _{t=0}^{T-1}m_t^{-2}}\) and \(M_T^{(1)}:=\sqrt{\sum _{t=0}^{T-1}m_t^{-1}}\), and choosing \(u=({D}/{2})\sqrt{B_1/B_0}\) and \(\eta =({\sigma M_T^{(1)}})/{D}\) as in Theorem 1, for any \(\delta \in (0,1)\), we have with probability at least \(1-\delta \) that

$$\begin{aligned} f({\varvec{\phi }}_T^{(x)})-f({\varvec{\phi }}^*)&\le \frac{2\sqrt{B_0B_1}D}{T}+\frac{\sigma DM_T^{(1)}}{T}+\frac{4\sigma DM_T^{(1)}\sqrt{\log \frac{2}{\delta }}}{T}\\&\hspace{2.5cm}+\frac{4\sigma D\max \bigl \{M_T^{(2)}\sqrt{2e\log \frac{2}{\delta }},m_0^{-1}\log \frac{2}{\delta }\bigr \}}{M_T^{(1)} T}. \end{aligned}$$

For option 3, we would need to consider the approximation error from the Riemann sum, and the final error rate would include additional O(1/T) terms. We omit the details for brevity.

Finally in this section, we relate the error of the objective to the error in terms of \({\varvec{\phi }}\), as measured through the squared \(L_2\) distance between the corresponding lower convex envelope functions.

Theorem 3

For any \({\varvec{\phi }}\in {{\varvec{\varPhi }}}\), we have

$$\begin{aligned} \int _{C_n}\bigl \{\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})-\textrm{cef}[{\varvec{\phi }}^*]({\varvec{x}})\bigr \}^2\,d{\varvec{x}}\le 2e^{\phi ^0}\bigl \{f({\varvec{\phi }})-f({\varvec{\phi }}^*)\bigr \}. \end{aligned}$$
(20)

5 Beyond log-concave density estimation

In this section, we extend our computational framework beyond the log-concave density family, through the notion of s-concave densities. For \(s\in \mathbb {R}\), define domains \({\mathcal {D}}_s\) and \(\psi _s:{\mathcal {D}}_s \rightarrow \mathbb {R}\) by

$$\begin{aligned} {\mathcal {D}}_s:=\left\{ \begin{array}{ll}[0,\infty )&{}\text { if } s<0\\ (-\infty ,\infty )&{}\text { if } s= 0\\ (-\infty ,0]&{}\text { if } s>0.\end{array}\right. ~~~~\text {and}~~~ \psi _s(y):= \left\{ \begin{array}{ll}y^{1/s}&{}\text { if } s<0\\ e^{-y}&{}\text { if } s= 0\\ (-y)^{1/s}&{}\text { if } s>0.\end{array}\right. \end{aligned}$$

Definition 1

[s-concave density, [65]] For \(s\in \mathbb {R}\), the class \({\mathcal {P}}_s(\mathbb {R}^d)\) of s-concave density functions on \(\mathbb {R}^d\) is given by

$$\begin{aligned}&{\mathcal {P}}_s(\mathbb {R}^d) \\&\hspace{0.2cm}:= \biggl \{p(\cdot ):p =\psi _s\circ \varphi \ \text{ for } \text{ some } \ \varphi \in {\mathcal {C}}_d \ \text{ with } \ \textrm{Im}(\varphi ) \subseteq {\mathcal {D}}_s \cup \{\infty \}, \int _{\mathbb {R}^d}p =1 \biggr \}. \end{aligned}$$

For \(s=0\), the family of s-concave densities reduces to the family of log-concave densities. Moreover, for \(s_1 < s_2\), we have \({\mathcal {P}}_{s_2}(\mathbb {R}^d)\subseteq {\mathcal {P}}_{s_1}(\mathbb {R}^d)\) [23, p. 86]. The s-concave density family introduces additional modelling flexibility, in particular allowing much heavier tails when \(s < 0\) than the log-concave family, but we note that there is no guidance available in the literature on how to choose s.

For the problem of s-concave density estimation, we discuss two estimation methods, both of which have been previously considered in the literature, but for which there has been limited algorithmic development. The first is based on the maximum likelihood principle (Sect. 5.1), while the other is based on minimizing a Rényi divergence (Sect. 5.2).

5.1 Computation of the s-concave maximum likelihood estimator

[65] proved that a maximum likelihood estimator over \({\mathcal {P}}_s(\mathbb {R}^d)\) exists with probability one for \(s\in (-1/d,\infty )\) and \(n > \max \bigl (\frac{dr}{r-d},d\bigr )\), where \(r:= -1/s\), and does not exist if \(s<-1/d\). [24] provide some statistical properties of this estimator when \(d=1\). The maximum likelihood estimation problem is to compute

$$\begin{aligned} {\hat{p}}_n:=\mathop {\textrm{argmax}}\limits _{p\in {\mathcal {P}}_s(\mathbb {R}^d)} \sum _{i=1}^n\log p({\varvec{x}}_i), \end{aligned}$$
(21)

or equivalently,

$$\begin{aligned} \mathop {\textrm{argmax}}\limits _{\varphi \in {\mathcal {C}}_d: \textrm{Im}(\varphi ) \subseteq {\mathcal {D}}_s\cup \{\infty \}}~\frac{1}{n}\sum _{i=1}^n\log \psi _s\circ \varphi ({\varvec{x}}_i) \quad \text {subject to}\quad \int _{\mathbb {R}^d}\psi _s\circ \varphi ({\varvec{x}})~d{\varvec{x}}=1. \qquad \end{aligned}$$
(22)

We establish the following theorem:

Theorem 4

Let \(s \in [0,1]\) and suppose that the convex hull \(C_n\) of the data is d-dimensional (so that the s-concave MLE \({\hat{p}}_n\) exists and is unique). Then computing \({\hat{p}}_n\) in (21) is equivalent to the convex minimization problem of computing

$$\begin{aligned} {\varvec{\phi }}^*:= \mathop {\textrm{argmin}}\limits _{{\varvec{\phi }}= (\phi _1,\ldots ,\phi _n) \in {\mathcal {D}}_s^n}\biggl \{-\frac{1}{n}\sum _{i=1}^n\log \psi _s(\phi _i)+ \int _{C_n}\psi _s\bigl (\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\bigr )\, d{\varvec{x}}\biggr \},\end{aligned}$$
(23)

in the sense that \({\hat{p}}_n = \psi _s \circ \textrm{cef}[{\varvec{\phi }}^*]\).

Remark 1

The equivalence result in Theorem 4 holds for any s (outside [0, 1]) as long as the s-concave MLE exists. However, when \(s\in [0,1]\), (23) is a convex optimization problem. The family of s-concave densities with \(s<0\) appears to be more useful from a statistical viewpoint as it allows for heavier tails than log-concave densities, but the MLE cannot be then computed via convex optimization. Nevertheless, the entropy minimization methods discussed in Sect. 5.2 can be used to obtain s-concave density estimates for \(s > -1\).

5.2 Quasi-concave density estimation

Another route to estimate an s-concave density (or even a more general class) is via the following problem:

$$\begin{aligned} {\check{\varphi }}:= \mathop {\textrm{argmin}}\limits _{\varphi \in {\mathcal {C}}_d:\textrm{dom}(\varphi ) = C_n} \biggl \{\frac{1}{n}\sum _{i=1}^n\varphi ({\varvec{x}}_i)+\int _{C_n}\varPsi \bigl (\varphi ({\varvec{x}})\bigr )~d{\varvec{x}}\biggr \},\end{aligned}$$
(24)

where \(\varPsi :\mathbb {R}\rightarrow (-\infty ,\infty ]\) is a decreasing, proper convex function. When \(\varPsi (y) = e^{-y}\), (24) is equivalent to the MLE for log-concave density estimation (1), by [20, Theorem 1]. This problem, proposed by [46], is called quasi-concave density estimation. [46, Theorem 4.1] show that under some assumptions on \(\varPsi \), there exists a solution to (24), and if \(\varPsi \) is strictly convex, then the solution is unique. Furthermore, if \(\varPsi \) is differentiable on the interior of its domain, then the optimal solution to the dual of (24) is a probability density p such that \(p=-\varPsi '(\varphi )\), and the dual problem can be regarded as minimizing different distances or entropies (depending on \(\varPsi \)) between the empirical distribution of the data and p. In particular, when \(\beta \ge 1\) and \(\varPsi (y) = \mathbb {1}_{\{y \le 0\}}(-y)^\beta /\beta \), and when \(\beta < 0\) and \(\varPsi (y) = -y^\beta /\beta \) for \(y \ge 0\) (with \(\varPsi (y) = \infty \) otherwise), the dual problem of (24) is essentially minimizing the Rényi divergence and we have the primal-dual relationship \(p=|\varphi |^{\beta -1}\). In fact, this amounts to estimating an s-concave density via Rényi divergence minimization with \(\beta =1+1/s\) and \(s \in (-1,\infty ) {\setminus } \{0\}\). We therefore consider the problem

$$\begin{aligned} \min _{\begin{array}{c} \varphi \in {\mathcal {C}}_d:\textrm{dom}(\varphi ) = C_n\\ \textrm{Im}(\varphi ) \subseteq {\mathcal {D}}_s \end{array}} \biggl \{\frac{1}{n}\sum _{i=1}^n\varphi ({\varvec{x}}_i)+\frac{1}{|1+1/s|}\int _{C_n}|\varphi ({\varvec{x}})|^{1+1/s}~d{\varvec{x}}\biggr \}. \end{aligned}$$
(25)

The proof of Theorem 5 is similar to that of Theorem 4, and is omitted for brevity.

Theorem 5

Given a decreasing proper convex function \(\varPsi \), the quasi-concave density estimation problem (24) is equivalent to the following convex problem:

$$\begin{aligned} {\varvec{\phi }}^*:= \mathop {\textrm{argmin}}\limits _{{\varvec{\phi }}\in {\mathcal {D}}_s^n}\biggl \{\frac{1}{n}{\varvec{1}}^\top {\varvec{\phi }}+ \int _{C_n}\varPsi \bigl (\textrm{cef}[{\varvec{\phi }}]({\varvec{x}})\bigr )~d{\varvec{x}}\biggr \},\end{aligned}$$
(26)

in the sense that \(\check{\varphi } = \textrm{cef}[{\varvec{\phi }}^*]\), with corresponding density estimator \({\tilde{p}}_n = -\varPsi ' \circ \textrm{cef}[{\varvec{\phi }}^*]\).

The objective in (26) is convex, so our computational framework can be applied to solve this problem.

6 Computational experiments on simulated data

In this section, we present numerical experiments to study the different variants of our algorithm and compare them with existing methods based on convex optimization for the log-concave MLE. Our results are based on large-scale synthetic datasets with \(n \in \{5{,}000,10{,}000\}\) observations generated from standard d-dimensional normal and Laplace distributions with \(d=4\). Code for our experiments is available from the github repository LogConcComp available at:

https://github.com/wenyuC94/LogConcComp.

All computations were carried out on the MIT Supercloud Cluster [58] on an Intel Xeon Platinum 8260 machine, with 24 CPUs and 24GB of RAM. Our algorithms were written in Python; we used Gurobi [36] to solve the LPs and QPs.

Our first comparison method is that of [20], implemented in the R package LogConcDEAD [18], and denoted by CSS. The CSS algorithm terminates when \(\Vert {\varvec{\phi }}^{(t)} - {\varvec{\phi }}^{(t-1)}\Vert _\infty \le \tau \), and we consider \(\tau \in \{10^{-2},10^{-3},10^{-4}\}\). Our other competing approach is the randomized smoothing method of [25], with random grids of a fixed grid size, which we denote here by RS-RF-m, with m being the grid size. To the best of our knowledge, this method has not been used to compute the log-concave MLE previously.

Fig. 2
figure 2

Plots on a log-scale of relative objective versus time (mins) [left panel] and number of iterations [right panel]. For each of our four synthetic data sets, we ran five repetitions of each algorithm, so each bold line corresponds to the median of the profiles of the corresponding algorithm, and each thin line corresponds to the profile of one repetition. For the right panel, we show the profiles up to 128 iterations

We denote the different variants of our algorithm as \(\hbox { Alg-}\ V\), where \(\text {Alg}\in \{\text {RS,NS}\}\) represents Algorithm 1 with Randomized smoothing and Nesterov smoothing, and \(V\in \{\text {DI},\text {RI}\}\) represents whether we use deterministic or random grids of increasing grid sizes to approximate the gradient. Further details of our input parameters are given in Appendix A.3.

Figure 2 presents the relative objective error, defined for an algorithm with iterates \({\varvec{\phi }}_1,\ldots ,{\varvec{\phi }}_t\) as

$$\begin{aligned} \texttt {relobj}(t):= \biggl |\frac{\min _{s\in [t]}f({\varvec{\phi }}_s)-f({\varvec{\phi }}^*)}{f({\varvec{\phi }}^*)}\biggr |, \end{aligned}$$
(27)

against time (in minutes) and number of iterations. In the definition of the relative objective error in (27) above, \({\varvec{\phi }}^*\) is taken as the CSS solution with tolerance \(\tau =10^{-4}\). The figure shows that randomized smoothing appears to outperform Nesterov smoothing in terms of the time taken to reach a desired relative objective error, since the former solves an LP (\(Q_0\)), whereas the latter has to solve a QP (\(Q_u\)); the number of iterations taken by the different methods is, however, similar. There is no clear winner between randomized and deterministic grids, and both appear to perform well.

Table 3 Comparison of our proposed methods with the CSS solution [20] and RS-RF [25]
Table 4 Statistics of the distance between the optimal solution and truth

Table 3 compares our proposed methods against the CSS solutions with different tolerances, in terms of running time, final objective function, and distances of the algorithm outputs to the optimal solution \({\varvec{\phi }}^*\) and the truth \({\varvec{\phi }}^{\text {truth}}\). We find that all of our proposals yield marked improvements in running time compared with the CSS solution: with \(n=10{,}000\), \(d=4\) and \(\tau = 10^{-4}\), CSS takes more than 20 h for all of the data sets we considered, whereas the RS-DI variant is around 50 times faster. The CSS solution may have a slightly improved objective function value on termination, but as shown in Table 3, all of our algorithms achieve an optimization error that is small by comparison with the statistical error, and from a statistical perspective, there is no point in seeking to reduce the optimization error further than this. Table 4 shows that the distances \(\Vert {\varvec{\phi }}^* - {\varvec{\phi }}^{\textrm{truth}}\Vert /n^{1/2}\) are well concentrated around their means (i.e. do not vary greatly over different random samples drawn from the underlying distribution), which provides further reassurance that our solutions are sufficiently accurate for practical purposes. On the other hand, the CSS solution with tolerance \(10^{-3}\) is not always sufficiently reliable in terms of its statistical accuracy, e.g. for a Laplace distribution with \(n=5{,}000\). Our further experiments on real data sets reported in Appendix A.4 provide qualitatively similar conclusions.

Finally, Fig. 3 compares our proposed multistage increasing grid sizes (RS-DI/RS-RI) (see Tables 5 and 6) with the fixed grid size (RS-RF) proposed by [25], under the randomized smoothing setting. We see that the benefits of using the increasing grid sizes as described by our theory carry over to improved practical performance, both in terms of running time and number of iterations.

Fig. 3
figure 3

Plots on a log-scale of relative objective versus time (mins) [left panel] and number of iterations [right panel]