Topics: Stochastic Gradient, Matching Pursuit, Compressed Sensing, Recommendation Systems

12.1 Online Linear Regression

This section explains the stochastic gradient descent algorithm, which is a technique used in many learning schemes.

Recall that a linear regression finds the parameters a and b that minimize the error

$$\displaystyle \begin{aligned} \sum_{k=1}^K (X_k - a - bY_k)^2, \end{aligned}$$

where the (X k, Y k) are observed samples that are i.i.d. with some unknown distribution f X,Y(x, y).

Assume that, instead of calculating the linear regression based on K samples, we keep updating the parameters (a, b) every time we observe a new sample.

Our goal is to find a and b that minimize

$$\displaystyle \begin{aligned} & E\left((X - a - bY)^2\right) \\ &= E\left(X^2\right) + a^2 + b^2 E\left(Y^2\right) - 2a E(X) - 2b E(XY) + 2 abE(Y) \\ & =: h(a, b). \end{aligned} $$

One idea is to use a gradient descent algorithm to minimize h(a, b). Say that at step k of the algorithm, one has calculated (a(k), b(k)). The gradient algorithm would update (a(k), b(k)) in the direction opposite of the gradient, to make h(a(k), b(k)) decrease. That is, the algorithm would compute

$$\displaystyle \begin{aligned} & a(k+1) = a(k) - \alpha \frac{\partial}{\partial a} h(a(k), b(k)) \\ & b(k+1) = b(k) - \alpha \frac{\partial}{\partial b} h(a(k), b(k)), \end{aligned} $$

where α is a small positive number that controls the step size. Thus,

$$\displaystyle \begin{aligned} & a(k+1) = a(k) - \alpha [2a(k) - 2E(X) + 2b(k)E(Y) ] \\ & b(k+1) = b(k) - \alpha [2b(k) E(Y^2) - 2E(XY) + 2a(k)E(Y)]. \end{aligned} $$

However, we do not know the distributions and cannot compute the expected values. Instead, we replace the mean values by the values of the new samples. That is, we compute

$$\displaystyle \begin{aligned} a(k+1) & = a(k) - \alpha [2a(k) - 2X(k+1) + 2b(k)Y(k+1)] \\ b(k+1) & = b(k)- \alpha [2b(k) Y^2(k+1) \\ & \quad - 2X(k+1)Y(k+1) + 2a(k)Y(k+1)]. \end{aligned} $$

That is, instead of using the gradient algorithm we use a stochastic gradient algorithm where the gradient is replaced by a noisy version. The intuition is that, if the step size is small, the errors between the true gradient and its noisy version average out.

The top part of Fig. 12.1 shows the updates of this algorithm for the example (9.4) with α = 0.002, E(X 2) = 1, and E(Z 2) = 0.3. In this example, we know that the LLSE is

$$\displaystyle \begin{aligned} L[X|Y] = a + bY = \frac{1}{1.3} Y = 0.77Y. \end{aligned}$$

The figure shows that (a k, b k) approaches (0, 0.77).

Fig. 12.1
figure 1

The coefficients “learned” with a stochastic gradient algorithm for (9.4) (top) and (9.5) (bottom)

The bottom part of Fig. 12.1 shows the coefficients for (9.5) with γ = 0.05, α = 1, and β = 6. We see that (a k, b k) approaches (−1, 7), which are the values for the LLSE.

12.2 Theory of Stochastic Gradient Projection

This algorithm is also called ‘stochastic gradient descent’.

In this section, we explain the theory of the stochastic gradient algorithm that we illustrated in the case of online regression. We start with a discussion of the deterministic gradient projection algorithm.

Consider a smooth convex function on a convex set, such as a soup bowl. A standard algorithm to minimize that function, i.e., to find the bottom of the bowl, is the gradient projection algorithm. This algorithm is similar to going downhill by making smaller and smaller jumps along the steepest slope. The projection makes sure that one remains in the acceptable set. The step size of the algorithm decreases over time so that one does not keep on overshooting the minimum.

The stochastic gradient projection algorithm is similar except that one has access only to a noisy version of the gradient. As the step size gets small, the errors in the gradient tend to average out and the algorithm converges to the minimum of the function.

We first review the gradient projection algorithm and then discuss the stochastic gradient projection algorithm.

12.2.1 Gradient Projection

Consider the problem of minimizing a convex differentiable function f(x) on a closed convex subset \(\mathcal {C}\) of \(\Re ^d\). By definition, \(\mathcal {C}\) is a convex set if

$$\displaystyle \begin{aligned} \theta \mathbf{x} + (1 - \theta) \mathbf{y} \in \mathcal{C}, \forall \mathbf{x}, \mathbf{y} \in \mathcal{C} \mbox{ and } \theta \in (0, 1). \end{aligned} $$
(12.1)

That is, \(\mathcal {C}\) contains the line segment between any two of its points. That is, there are no holes or kinks in the set boundary (Fig. 12.2).

Fig. 12.2
figure 2

A non-convex set (left) and a convex set (right)

Also (see Fig. 12.3), recall that a function \(f : \mathcal {C} \to \Re \) is a convex function if (Fig. 12.3)

$$\displaystyle \begin{aligned} f(\theta \mathbf{x} + (1 - \theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1 - \theta) f(\mathbf{y}), \forall \mathbf{x}, \mathbf{y} \in \mathcal{C} \mbox{ and } \theta \in (0, 1).\end{aligned} $$
(12.2)
Fig. 12.3
figure 3

A non-convex function (top) and a convex function (bottom)

A standard algorithm is gradient projection (GP):

$$\displaystyle \begin{aligned} {\mathbf{x}}_{n+1} = [{\mathbf{x}}_n - \alpha_n \nabla f({\mathbf{x}}_n)]_{\mathcal{C}}, \mbox{ for } n \geq 0. \end{aligned}$$

Here,

$$\displaystyle \begin{aligned} \nabla f(\mathbf{x}) := \left[ \frac{\partial}{\partial x_1} f(\mathbf{x}), \ldots, \frac{\partial}{\partial x_d} f(\mathbf{x})\right]' \end{aligned}$$

is the gradient of f(⋅) at x and \([\mathbf {y}]_{\mathcal {C}}\) indicates the closest point to y in \(\mathcal {C}\), also called the projection of y onto \(\mathcal {C}\). The constants α n > 0 are called the step sizes of the algorithm.

As a simple example, let f(x) = 6(x − 0.2)2 for \(x \in \mathcal {C} := [0, 1]\). The factor 6 is there only to have big steps initially and show the necessity of projecting back into the convex set. With α n = 1∕n and x 0 = 0, the algorithm is

$$\displaystyle \begin{aligned} x_{n+1} = \left[x_n - \frac{12}{n}(x_n - 0.2)\right]_{\mathcal{C}}.\end{aligned} $$
(12.3)

Equivalently,

$$\displaystyle \begin{aligned} & y_{n+1} = x_n - \frac{12}{n}(x_n - 0.2) {} \end{aligned} $$
(12.4)
$$\displaystyle \begin{aligned} & x_{n+1} = \max\{0, \min \{1, y_{n+1}\}\} {}\end{aligned} $$
(12.5)

with y 0 = x 0.

As the Fig. 12.4 shows, when the step size is large, the update y n+1 falls outside the set \(\mathcal {C}\) and it is projected back into that set. Eventually, the updates fall into the set \(\mathcal {C}\).

Fig. 12.4
figure 4

The gradient projection algorithm (12.4) and (12.5)

There are many known sufficient conditions that guarantee that the algorithm converges to the unique minimizer of f(⋅) on \(\mathcal {C}\). Here is an example.

Theorem 12.1

Assume that f(x) is convex and differentiable on the convex set \(\mathcal {C}\) and such

$$\displaystyle \begin{aligned} & f(x) \mbox{ has a unique minimizer {$x^*$} in } \mathcal{C} {} \end{aligned} $$
(12.6)
$$\displaystyle \begin{aligned} & || \nabla f(x) ||{}^2 \leq K, \forall x \in \mathcal{C} {} \end{aligned} $$
(12.7)
$$\displaystyle \begin{aligned} & \sum_n \alpha_n = \infty \mathit{\mbox{ and }} \sum_n \alpha_n^2 < \infty. {} \end{aligned} $$
(12.8)

Then

$$\displaystyle \begin{aligned} x_n \rightarrow x^* \mathit{\mbox{ as }} n \rightarrow \infty. \end{aligned}$$

\({\blacksquare }\)

Proof

The idea of the proof is as follows. Let \(d_n = \frac {1}{2}||x_n - x^*||{ }^2\). Fix 𝜖 > 0. One shows that there is some n 0(𝜖) so that, when n ≥ n 0(𝜖),

$$\displaystyle \begin{aligned} & d_{n+1} \leq d_n - \gamma_n, \mbox{ if } d_n \geq \epsilon {} \end{aligned} $$
(12.9)
$$\displaystyle \begin{aligned} & d_{n+1} \leq 2 \epsilon, \mbox{ if } d_n < \epsilon. {} \end{aligned} $$
(12.10)

Moreover, in (12.9), γ n > 0 and ∑n γ n = .

It follows from (12.9) that, eventually, for some n = n 1(𝜖) ≥ n 0(𝜖), one has d n < 𝜖. But then, because of (12.9) and (12.10), d n < 2𝜖 for all n ≥ n 1(𝜖). Since 𝜖 > 0 is arbitrary, this proves that x n → x .

To show (12.9) and (12.10), we first claim that

$$\displaystyle \begin{aligned} d_{n+1} \leq d_n + \alpha_n (x^* - x_n)^T \nabla f(x_n) + \frac{1}{2} \alpha_n^2 K. \end{aligned} $$
(12.11)

To see this, note that

$$\displaystyle \begin{aligned} & d_{n+1} = \frac{1}{2} ||[x_n - \alpha_n \nabla f(x_n)]_{\mathcal{C}} - x^*||{}^2 \\ &~~~~~~~ \leq \frac{1}{2} ||x_n - \alpha_n \nabla f(x_n) - x^*||{}^2 {} \end{aligned} $$
(12.12)
$$\displaystyle \begin{aligned} &~~~~~~~ \leq d_n + \alpha_n (x^* - x_n)^T \nabla f(x_n) + \frac{1}{2} \alpha_n^2 K. {} \end{aligned} $$
(12.13)

The inequality in (12.12) comes from the fact that projection on a convex set is non-expansive. That is,

$$\displaystyle \begin{aligned} \|x_{\mathcal{C}} - y_{\mathcal{C}}\| \leq \|x - y\|. \end{aligned}$$

This property is clear from a picture (see Fig. 12.5) and is not difficult to prove.

Fig. 12.5
figure 5

Projection on a convex set is non-expansive

Observe that α n → 0, because \(\sum _n \alpha _n^2 < \infty \). Hence, (12.13) and (12.7) imply (12.10).

It remains to show (12.9). As Fig. 12.6 shows, the convexity of f(⋅) implies that

$$\displaystyle \begin{aligned} (x^* - x )^T \nabla f(x) \leq f(x^*) - f(x). \end{aligned} $$
(12.14)

Also, if d n ≥ 𝜖, one has f(x ) − f(x n) ≤−δ(𝜖), for some δ(𝜖) > 0. Thus, whenever d n ≥ 𝜖, one has

$$\displaystyle \begin{aligned} (x^* - x_n )^T \nabla f(x_n) \leq - \delta (\epsilon). \end{aligned}$$

Together with (12.11), this implies

$$\displaystyle \begin{aligned} d_{n+1} \leq d_n - \alpha_n \delta(\epsilon) + \frac{1}{2} \alpha_n^2 K. \end{aligned}$$
Fig. 12.6
figure 6

The inequality (12.14)

Now, let

$$\displaystyle \begin{aligned} \gamma_n = \alpha_n \delta(\epsilon) - \frac{1}{2} \alpha_n^2 K. \end{aligned} $$
(12.15)

Since α n → 0, there is some n 2(𝜖) such that γ n > 0 for n ≥ n 2(𝜖). Moreover, (12.8) is seen to imply that ∑n γ n = . This proves (12.9) after replacing n 0(𝜖) by \(\max \{n_0(\epsilon ), n_2(\epsilon )\}\). □

12.2.2 Stochastic Gradient Projection

There are many situations where one cannot measure directly the gradient ∇f(x n) of the function. Instead, one has access to a random estimate of that gradient, ∇f(x n) + η n, where η n is a random variable. One hopes that, if the error η n is small enough, GP still converges to x when one uses ∇f(x n) + η n instead of ∇f(x n). The point of this section is to justify this hope.

The algorithm is as follows (see Fig. 12.7):

$$\displaystyle \begin{aligned} {\mathbf{x}}_{n+1} = \left[{\mathbf{x}}_n - \alpha_n {\mathbf{g}}_n\right]_{\mathcal{C}}, \end{aligned} $$
(12.16)

where

$$\displaystyle \begin{aligned} {\mathbf{g}}_n = \nabla f({\mathbf{x}}_n) + {\mathbf{z}}_n + {\mathbf{b}}_n \end{aligned} $$
(12.17)

is a noisy estimate of the gradient. In (12.17), z n is a zero-mean random variable that models the estimation noise and b n is a constant that models the estimation bias.

Fig. 12.7
figure 7

The figure shows level curves of f(⋅) and the convex set \(\mathcal {C}\). It also shows the first few iterations of GPA in red and of SGPA in blue

As a simple example, let f(x) = 6(x − 0.2)2 for \(x \in \mathcal {C} := [0, 1]\). With α n = 1∕n, b n = 0, and x 0 = 0, the algorithm is

$$\displaystyle \begin{aligned} x_{n+1} = \left[x_n - \frac{12}{n}(x_n - 0.2 + z_n)\right]_{\mathcal{C}}. \end{aligned} $$
(12.18)

In this expression, the z n are i.i.d. U[−0.5, 0.5]. Figure 12.8 shows the values that the algorithm produces.

Fig. 12.8
figure 8

The stochastic gradient projection algorithm (12.18)

This algorithm converges to the minimum x  = 0.2 of the function, albeit slowly.

For the algorithm (12.16) and (12.17) to converge, one needs the estimation noise z n and bias b n to be small. Specifically, one has the following result.

Theorem 12.2

Assume that \(\mathcal {C}\) is bounded and

$$\displaystyle \begin{aligned} & f(.) \mathit{\mbox{ has a unique minimizer }} {\mathbf{x}}^* \mathit{\mbox{ in }} \mathcal{C}; {} \end{aligned} $$
(12.19)
$$\displaystyle \begin{aligned} & || \nabla f(\mathbf{x}) ||{}^2 \leq K, \forall x \in \mathcal{C}; {} \end{aligned} $$
(12.20)
$$\displaystyle \begin{aligned} & \alpha_n > 0, \sum_n \alpha_n = \infty, \sum_n \alpha_n^2 < \infty. {} \end{aligned} $$
(12.21)

In addition, assume that

$$\displaystyle \begin{aligned} & \sum_{n=0}^\infty \alpha_n ||{\mathbf{b}}_n|| < \infty; {} \end{aligned} $$
(12.22)
$$\displaystyle \begin{aligned} & E[ z_{n+1} \mid z_0, z_1 , \ldots , z_n ] = 0; {} \end{aligned} $$
(12.23)
$$\displaystyle \begin{aligned} & E(|| z_n ||{}^2 ) \leq A, n \geq 0. {} \end{aligned} $$
(12.24)

Then x n →x with probability one. \({\blacksquare }\)

Proof

The proof is essentially the same as for the deterministic case.

The inequality (12.11) becomes

$$\displaystyle \begin{aligned} d_{n+1} \leq d_n + \alpha_n ({\mathbf{x}}^* - {\mathbf{x}}_n)^T [\nabla f({\mathbf{x}}_n) + {\mathbf{z}}_n + {\mathbf{b}}_n] + \frac{1}{2} \alpha_n^2 K. \end{aligned} $$
(12.25)

Accordingly, γ n in (12.15) is replaced by

$$\displaystyle \begin{aligned} \gamma_n = \alpha_n \left[\delta(\epsilon) + ({\mathbf{x}}^* - {\mathbf{x}}_n)^T ({\mathbf{z}}_n + {\mathbf{b}}_n)\right] - \frac{1}{2} \alpha_n^2 K. \end{aligned} $$
(12.26)

Now, (12.23) implies that \({\mathbf {v}}_n := \sum _{m=0}^n \alpha _m {\mathbf {z}}_m\) is a martingale.Footnote 1 Because of (12.24) and (12.21), one has \(E( ||{\mathbf {v}}_n||{ }^2) \leq A \sum _{m=0}^\infty \alpha _m < \infty \) for all n. This implies, by the Martingale Convergence Theorem 12.3, that v n converges to a finite random variable. Combining this fact with (12.22) shows thatFootnote 2 \(\sum _{m = n}^\infty \alpha _m [ {\mathbf {z}}_m + {\mathbf {b}}_m] \rightarrow 0\). Since ||x n −x || is bounded, this implies that the effect of the estimation error is asymptotically negligible and that argument used in the proof of GP applies here. □

12.2.3 Martingale Convergence

We discuss the theory of martingales in Sect. 15.9. Here are the ideas we needed in the proof of Theorem 12.2.

Let {x n, y n, n ≥ 0} be random variables such that E(x n) is well-defined for all n. The sequence x n is said to be a martingale with respect to {(x m, y m), m ≥ 0} if

$$\displaystyle \begin{aligned} E[x_{n+1} | x_m, y_m, m \leq n] = x_n, \forall n. \end{aligned}$$

Theorem 12.3 (Martingale Convergence Theorem)

If a martingale x n is such that \(E(x_n^2) \leq B < \infty \) for all n, then it converges with probability one to a finite random variable. \({\blacksquare }\)

For a proof, see Theorem 15.13.

12.3 Big Data

The web makes it easy to collect a vast amount of data from many sources. Examples include books, movie, and restaurants that people like, website that they visit, their mobility patterns, their medical history, and measurements from sensors. This data can be useful to recommend items that people will probably like, treatments that are likely to be effective, people you might want to commute with, to discover who talks to who, efficient management techniques, and so on. Moreover, new technologies for storage, databases, and cloud computing make it possible to process huge amounts of data. This section explains a few of the formulations of such problems and algorithms to solve them (Fig. 12.9).

Fig. 12.9
figure 9

The web provides access to vast amounts of data. How does one extract useful knowledge from that data?

12.3.1 Relevant Data

Many factors potentially affect an outcome, but what are the most relevant ones? For instance, the success in college of a student is correlated with her high-school GPA, her scores in advanced placement courses and standardized tests. How does one discover the factors that best predict her success? A similar situation occurs for predicting the odds of getting a particular disease, the likelihood of success of a medical treatment, and many other applications.

Identifying these important factors can be most useful to improve outcomes. For instance, if one discovers that the odds of success in college are most affected by the number of books that a student has to read in high-school and by the number of hours she spends playing computer games, then one may be able to suggest strategies for improving the odds of success.

One formulation of the problem is that the outcome Y  is correlated with a collection of factors that we represent by a vector X with N ≫ 1 components. For instance, if Y  is the GPA after 4 years in college, the first component X 1 of X might indicate the high-school GPA, the second component X 2 the score on a specific standardized test, X 3 the number of books the student had to write reports on, and so on. Intuition suggests that, although N ≫ 1, only relatively few of the components of X really affect the outcome Y  in a significant way. However, we do not want to presume that we know what these components are.

Say that you want to predict Y  on the basis of six components of X. Which ones should you consider? This problem turns out to be hard because there are many (about N 6∕6!) subsets with 6 elements in \(\mathcal {N} = \{1, 2, \ldots , N\}\), and this combinatorial aspect of the problem makes it intractable when N is large. To make progress, we change the formulation slightly and resort to some heuristic (Fig. 12.10).Footnote 3

figure 10

Fig. 12.10

The change in formulation is to consider the problem of minimizing

$$\displaystyle \begin{aligned} J(\mathbf{b}) = E\left((Y - \sum_n b_n X_n)^2\right) \end{aligned}$$

over b = (b 1, …, b N), subject to a bound on

$$\displaystyle \begin{aligned} C(\mathbf{b}) = \sum_n |b_n|. \end{aligned}$$

This is called the LASSO problem, for “least absolute shrinkage and selection operator.” Thus, the hard constraint on the number of components is replaced by a cost for using large coefficients. Intuitively, the problem is still qualitatively similar. Also, the constraint is such that the solution of the problem has many b n equal to zero. Intuitively, if a component is less useful than others, its coefficient is probably equal to zero in the solution.

One interpretation of this problem as follows. In order to simplify the algebra, we assume that Y  and X are zero-mean. Assume that

$$\displaystyle \begin{aligned} Y = \sum_n B_n X_n + Z, \end{aligned}$$

where Z is \(\mathcal {N}(0, \sigma ^2)\) and the coefficients B n are random and independent with a prior distribution of B n given by

$$\displaystyle \begin{aligned} f_n(b) = \frac{\lambda}{2} \exp\{ - \lambda |b|\}. \end{aligned}$$

Then

$$\displaystyle \begin{aligned} MAP[\mathbf{B} | \mathbf{X} = \mathbf{x}, Y = y] &= \arg \max_{\mathbf{b}} f_{\mathbf{B} | \mathbf{X}, Y}[\mathbf{b} | \mathbf{x}, y ] \\ & = \arg \max_{\mathbf{b}} f_{\mathbf{B}} (\mathbf{b}) f_{Y| \mathbf{X}}[y | \mathbf{x}] \\ & = \arg \max_{\mathbf{b}} \exp\left\{ - \frac{1}{2 \sigma^2} \left(y - \sum_n b_n x_n\right)^2 \right\}\\ & \quad \times \exp\{- \lambda \sum_n |b_n| \} \\ & = \arg \min_{\mathbf{b}} \left\{ \left(y - \sum_n b_n x_n\right)^2 + \mu \sum_n |b_n| \right\} \end{aligned} $$

with μ = 2λσ 2. This formulation is the Lagrange multiplier formulation of the LASSO problem where the constraint on the cost C(b) is replaced by a penalty μC(b). Thus, the LASSO problem is equivalent to finding MAP[B|X, Y ] under the assumptions stated above.

We explain a greedy algorithm that selects the components one by one, trying to maximize the progress that it makes with each selection. First assume that we can choose only one component X n among the N elements in X. We know that

$$\displaystyle \begin{aligned} L[Y|X_n] = \frac{\mbox{cov}(Y, X_n)}{\mbox{var}(X_n)} X_n =: b_n X_n \end{aligned}$$

and

$$\displaystyle \begin{aligned} E((Y - L[Y|X_n])^2) &= \mbox{var}(Y) - \frac{\mbox{cov}(Y, X_n)^2}{\mbox{var}(X_n)} \\ &= \mbox{var}(Y) - |\mbox{cov}(Y, X_n)| \times | b_n|. \end{aligned} $$

Thus, one unit of “cost” C(b n) = |b n| invested in b n brings a reduction |cov(Y, X n)| in the objective J(b n). It then makes sense to choose the first component with the largest value of “reward per unit cost” |cov(Y, X n)|. Say that this component is X 1 and let \(\hat Y_1 = L[Y|X_1]\).

Second, assume that we stick to our choice of X 1 with coefficient b 1 and that we look for a second component X n with n ≠ 1 to add to our estimate. Note that

$$\displaystyle \begin{aligned} & E((Y - b_1X_1 - b_nX_n)^2) \\ &~~~= E((Y - b_1X_1)^2) - 2b_n \mbox{cov}(Y - b_1 X_1,X_n) + b_n^2 \mbox{var}((X_n) . \end{aligned} $$

This expression is minimized over b n by choosing

$$\displaystyle \begin{aligned} b_n = \frac{\mbox{cov}(Y - b_1 X_1,X_n)}{\mbox{var}(X_n)} \end{aligned}$$

and it is then equal to

$$\displaystyle \begin{aligned} E((Y - b_1X_1)^2) - \frac{\mbox{cov}(Y - b_1X_1, X_n)^2}{\mbox{var}(X_n)}. \end{aligned}$$

Thus, as before, one unit of additional cost in C(b 1, b n) invested in b n brings a reduction

$$\displaystyle \begin{aligned} |\mbox{cov}(Y - b_1X_1, X_n)| \end{aligned}$$

in the cost J(b 1, b n). This suggests that the second component X n to pick should be the one with the largest covariance with Y − b 1 X 1.

These observations suggest the following algorithm, called the stepwise regression algorithm. At each step k, the algorithm finds the component X n that is most correlated with the residual error \(Y - \hat Y_k\), where \(\hat Y_k\) is the current estimate. Specifically, the algorithm is as follows:

$$\displaystyle \begin{aligned} & \mbox{Step } 0: \hat Y_0 = E(Y) \mbox{ and } S_0 = \emptyset; \\ & \mbox{Step } k+1: \mbox{ Find } n \notin S_k \mbox{ that maximizes } E((Y - \hat Y_k)X_n) \\ & ~~~~~~~~~~~~~~~~~~~ \mbox{Let } S_{k+1} = S_k \cup \{n\}, Y_{k+1} = L[Y | X_n, n \in S_{k+1}], k = k+1; \\ & \mbox{Repeat until } E((Y - \hat Y_k)^2) \leq \epsilon. \end{aligned} $$

In practice, one is given a collection of outcomes {Y m, m = 1, …, M} of with factors \({\mathbf {X}}^m = (X_1^m, X_2^m, \ldots , X^m_N)\). Here, each m corresponds to one sample, say one student in the college success example. From those samples, one can estimate the mean values by the sample means. Thus, in step k, one has calculated coefficients (b 1, …, b k) to calculate

$$\displaystyle \begin{aligned} \hat Y^m_k = b_1 X^m_1 + \cdots + b_k X^m_k . \end{aligned}$$

One then estimates \(E((Y - \hat Y_k)X_n)\) by

$$\displaystyle \begin{aligned} \frac{1}{M} \sum_{m=1}^M (Y^m - \hat Y^m_k)X^m_n. \end{aligned}$$

Also, one approximates L[Y |X n, n ∈ S k+1] by the linear regression.

It is useful to note that, by the Law of Large Numbers, the number M of samples needed to estimate the means and covariances is not necessarily very large. Thus, although one may have data about millions of students, a reasonable estimate may be obtained from a few thousand. Recall that one can use the sample moments to compute confidence intervals for these estimates.

Signal processing uses a similar algorithm called matching pursuit introduced in Mallat and Zhang (1993). In that context, the problem is to find a compact representation of a signal, such as a picture or a sound. One considers a representation of the signal as a linear combination of basis functions. The matching pursuit algorithm finds the most important basis functions to use in the representation.

An Example

Our example is very small, so that we can understand the steps. We assume that all the random variables are zero-mean and that N = 3 with

$$\displaystyle \begin{aligned} \varSigma_{\mathbf{Z}} = \left[ \begin{array}{c c c c} 4 & 3 & 2 & 2 \\ 3 & 4 & 2 & 2 \\ 2 & 2 & 4 & 1 \\ 2 & 2 & 1 & 4 \end{array} \right], \end{aligned}$$

where Z  = (Y, X 1, X 2, X 3) = (Y, X ).

We first try the stepwise regression. The component X n most correlated with Y  is X 1. Thus,

$$\displaystyle \begin{aligned} \hat Y_1 = L[Y|X_1] = \frac{\mbox{cov}(Y, X_1)}{\mbox{var}(X_1)} X_1 = \frac{3}{4} X_1 =: b_1 X_1. \end{aligned}$$

The next step is to compute the correlations \(E(X_n(Y - \hat Y_1))\) for n = 2, 3. We find

$$\displaystyle \begin{aligned} & E(X_2(Y - \hat Y_1)) = E(X_2(Y - b_1X_1)) = 2 - 2b_1 = 0.5 \\ & E(X_3(Y - \hat Y_1)) = E(X_3(Y - b_1 X_1)) = 2 - 2b_1 = 0.5. \end{aligned} $$

Hence, the algorithm selects X 2 as the next components and one finds

$$\displaystyle \begin{aligned} \hat Y_2 = L[Y|X_1, X_2] = \left[ \begin{array}{c c } 3 & 2 \end{array} \right] \left[ \begin{array}{c c } 4 & 2 \\ 2 & 4 \end{array} \right]^{-1} \left[ \begin{array}{c } X_1 \\ X_2 \end{array} \right] = \frac{2}{3} X_1 + \frac{1}{6} X_2. \end{aligned}$$

The resulting error variance is

$$\displaystyle \begin{aligned} E\left((Y - \hat Y_2)^2\right) = \frac{5}{3}. \end{aligned}$$

12.3.2 Compressed Sensing

Complex looking objects may have a simple hidden structure. For example, the signal s(t) shown in Fig. 12.11 is the sum of three sine waves. That is,

$$\displaystyle \begin{aligned} s(t) = \sum_{i=1}^3 b_i\sin{}(2 \pi \phi_i t), t \geq 0. \end{aligned} $$
(12.27)
Fig. 12.11
figure 11

A complex looking signal that is the sum of three sine waves

A classical result, called the Nyquist sampling theorem, states that one can reconstruct a signal exactly from its values measured every T seconds, provided that 1∕T is at least twice the largest frequency in the signal. According to that result, we could reconstruct s(t) by specifying its value every T seconds if T < 1∕(2ϕ i) for i = 1, 2, 3. However, in the case of (12.27), one can describe s(t) completely by specifying the values of the six parameters {b i, ϕ i, i = 1, 2, 3}. Also, it seems clear in this particular case that one does not need to know many sample values s(t k) for different times t k to be able to reconstruct the six parameters and therefore the signal s(t) for all t ≥ 0. Moreover, one expects the reconstruction to be unique if we choose a few sampling times t k randomly. The same is true if the representation is in terms of different functions, such as polynomials or wavelets.

This example suggests that if a signal has a simple representation in terms of some basis functions (e.g., sine waves), then it is possible to reconstruct it exactly from a small number of samples.

Computing the parameters of (12.27) from a number of samples s(t k) is highly nontrivial, so that the fact that it is possible does not seem very useful. However, a slightly different perspective shows that the problem can be solved. Assume that we have a collection of functions (Fig. 12.12)

$$\displaystyle \begin{aligned} g_n(t) = \sin{}(2 \pi f_n t), t \geq 0, n = 1, \ldots, N. \end{aligned}$$

Assume also that the frequencies {ϕ 1, ϕ 2, ϕ 3} in s(t) are in the collection {f n, n = 1, …, N}. We can then try to find the vector a = {a n, n = 1, …, N} such that

$$\displaystyle \begin{aligned} s(t_k) = \sum_{n=1}^N a_n g_n(t_k), \mbox{ for } k = 1, \ldots, K. \end{aligned}$$

We should be able to do this with three functions, by choosing the appropriate coefficients. How do we do this systematically? A first idea is to formulate the following problem:

$$\displaystyle \begin{aligned} & \mbox{Minimize } \sum_n 1\{a_n \neq 0\} \\ & \mbox{ such that } s(t_k) = \sum_n a_n g_n(t_k), \mbox{ for } k = 1, \ldots, K. \end{aligned} $$

That is, one tries to find the most economical representation of s(t) as a linear combination of functions in the collection.

Fig. 12.12
figure 12

A tough nut to crack!

Unfortunately, this problem is intractable because of the number of choices of sets of nonzero coefficients a n, a difficulty we already faced in the previous section. The key trick is, as before, to convert the problem into a much easier one that retains the main goal.

The new problem is as follows:

$$\displaystyle \begin{aligned} & \mbox{Minimize } \sum_n |a_n| \\ & \mbox{ such that } s(t_k) = \sum_n a_n g_n(t_k), \mbox{ for } k = 1, \ldots, K. \\ \end{aligned} $$
(12.28)

Trying to minimize the sum of the absolute values of the coefficients a n is a relaxation of limiting the number of nonzero coefficients. (Simple examples show that choosing ∑n|a n|2 instead of ∑n|a n| often leads to bad reconstructions.) The result is that if K is large enough, then the solution is exact with a high probability.

Theorem 12.4 (Exact Recovery from Random Samples)

The signal s(t) can be recovered exactly with a very high probability from K samples by solving (12.28) if

$$\displaystyle \begin{aligned} K \geq C \times B \times \log(N). \end{aligned}$$

In this expression, C is a small constant, B is the number of sine waves that make up s(t), and N is the number of sine waves in the collection. \({\blacksquare }\)

Note that this is a probabilistic statement. Indeed, one could be unlucky and choose sampling times t k, where s(t k) = 0 (see Fig. 12.11) and these samples would not enable the reconstruction of s(t). More generally, the samples could be chosen so that they do not enable an exact reconstruction. The theorem says that the probability of poor samples is very small.

Thus, in our example, where B = 3, one can expect to recover the signal s(t) exactly from about \(3 \log (100) \approx 14\) samples if N ≤ 100.

Problem (12.28) is equivalent to the following linear programming problem, which implies that it is easy to solve:

$$\displaystyle \begin{aligned} & \mbox{Minimize } \sum_n b_n \\ & \mbox{ such that } s(t_k) = \sum_n a_n g_n(t_k), \mbox{ for } k = 1, \ldots, K \\ & \mbox{ and } - b_n \leq a_n \leq b_n, \mbox{ for } n = 1, \ldots, N. {} \end{aligned} $$
(12.29)

Assume that

$$\displaystyle \begin{aligned} s(t) = \sin{}(2 \pi t) + 2 \sin{}(2.4 \pi t) + 3 \sin{}(3.2 \pi t), t \in [0, 1]. \end{aligned} $$
(12.30)

The frequencies in s(t) are ϕ 1 = 1, ϕ 2 = 1.2, and ϕ 3 = 1.6. The collection of functions is

$$\displaystyle \begin{aligned} \{g_n(t) = \sin{}(2 \pi f_n t), n = 1, \ldots, 100\}, \end{aligned}$$

where f n = n∕10.

The frequencies of the sine waves in the collection are 0.1, 0.2, …, 10. Thus, the frequencies in s(t) are contained in the collection, so that perfect reconstruction is possible as

$$\displaystyle \begin{aligned} s(t) = \sum_n a_n g_n(t) \end{aligned}$$

with a 10 = 1, a 12 = 2, and a 16 = 3, and all the other coefficients a n equal to zero. The theory tells us that reconstruction should be possible with about 14 samples. We choose 15 sampling times t k randomly and uniformly in [0, 1]. We then ask Python to solve (12.29). The solution is shown in Fig. 12.13.

Fig. 12.13
figure 13

Exact reconstruction of the signal (12.30) with 15 samples chosen uniformly in [0, 1]. The signal is in green and the reconstruction in blue

Another Example

Figure 12.14, from Candes and Romberg (2007), shows another example. The image on top has about one million pixels. However, it can be represented as a linear combination of 25,000 functions called wavelets. Thus, the compressed sensing results tell us that one should be able to reconstruct the picture exactly from a small multiple of 25,000 randomly chosen pixels. It turns out that this is indeed the case with about 96,000 pixels.

Fig. 12.14
figure 14

Original image with 106 pixels (top) and reconstruction from 96,000 randomly chosen pixels (bottom)

12.3.3 Recommendation Systems

Which movie would you like to watch? One formulation of the problem is as follows. There is a K × N matrix Y . The entry Y (k, n) of the matrix indicates how much user k likes movie n. However, one does not get to observe the complete matrix. Instead, one observes a number of entries, when users actually watch movies and one gets to record their rankings. The problem is to complete the matrix to be able to recommend movies to users.

This matrix completion is based on the idea that the entries of the matrix are not independent. For instance, assume that Bob and Alice have seen the same five movies and gave them the same ranking. Assume that Bob has seen another movie he loved. Chances are that Alice would also like it.

To formulate this dependency of the entries of the matrix Y , one observes that even though there are thousands of movies, a few factors govern how much users like them. Thus, it is reasonable to expect that many columns of the matrix are combinations of a few common vectors that correspond to the hidden factors that influence the rankings by users. Thus, a few independent vectors get combined into linear combinations that form the columns. Consequently the matrix Y  has a small number of linearly independent columns, i.e., it is a low rank matrix.Footnote 4 This observation leads to the question of whether one can recover a low rank matrix Y  from observed entries?

One possible formulation is

$$\displaystyle \begin{aligned} \mbox{ Minimize rank}(X) \mbox{ s.t. } X(k, n) = M(k, n), \forall (k, n) \in \varOmega. \end{aligned}$$

Here, {M(k, n), (k, n) ∈ Ω} is the set of observed entries of the matrix. Thus, one wishes to find the lowest-rank matrix X that is consistent with the observed entries.

As before, such a problem is hard. To simplify the problem, one replaces the rank by the nuclear norm ||X|| where

$$\displaystyle \begin{aligned} ||X||{}_* = \sum_i \sigma_i, \end{aligned}$$

where the σ i are the singular values of the matrix X. The rank of the matrix counts the number of nonzero singular values. The nuclear norm is a convex function of the entries of the matrix, which makes the problem a convex programming problem that is easy to solve. Remarkably, as in the case of compressed sensing, the solution of the modified problem is very good.

Theorem 12.5 (Exact Matrix Completion from Random Entries)

The solution of the problem

$$\displaystyle \begin{aligned} \mathit{\mbox{Minimize }} ||X||{}_* \mathit{\mbox{ s.t. }} X(k, n) = M(k, n), \forall (k, n) \in \varOmega \end{aligned}$$

is the matrix Y  with a very high probability if the observed entries are chosen uniformly at random and if there are at least

$$\displaystyle \begin{aligned} C n^{1.25} r \log(n) \end{aligned}$$

observations. In this expression, C is a small constant, \(n = \max \{K, N\}\) , and r is the rank of Y . \({\blacksquare }\)

This result is useful in many situations where this number of required observations is much smaller than K × N, which is the number of entries of Y . The reference contains many extensions of these results and details on numerical solutions.

12.4 Deep Neural Networks

Deep neural networks (DNN) are electronic processing circuits inspired by the structure of the brain. For instance, our vision system consists of layers. The first layer is in the retina that captures the intensity and color of zones in our field of vision. The next layer extracts edges and motion. The brain receives these signals and extracts higher level features. A simplistic model of this processing is that the neurons are arranged in successive layers, where each neuron in one layer gets inputs from neurons in the previous layer through connections called synapses. Presumably, the weights of these connections get tuned as we grow up and learn to perform tasks, possibly by trial and errors. The figure sketches a DNN. The inputs at the left of the DNN are the features X from which the system produces the probability that X corresponds to a dog, or the estimate of some quantity (Fig. 12.15).

Fig. 12.15
figure 15

A neural network

Each circle is a circuit that we call a neuron. In the figure, z k is the output of neuron k. It is multiplied by θ k to contribute the quantity θ k z k to the total input V l of neuron l. The parameter θ k represents the strength of the connection between neuron k and neuron l. Thus, V l =∑n θ n z n, where the sum is over all the neurons n of the layer to the immediate left of neuron l, including neuron k. The output z l of neuron l is equal to f(a l, V l), where a l is a parameter specific to that neuron and f is some function that we discuss later.

With this structure, it is easy to compute the derivative of some output Z with respect to some weight, say θ k. We do it in the last section of this chapter.

What should be the functions f(a, V )? Inspired by the idea that a neuron fires if it is excited enough, one may use a function f(a, V ) that is close to 1 if V > a and close to − 1 if V < a. To make the function differentiable, one may use f(a, V ) = g(V − a) with

$$\displaystyle \begin{aligned} g(v) = \frac{2}{1 + e^{- \beta v}} - 1, \end{aligned}$$

where β is a positive constant. If β is large, then e βv goes from a very large to a very small value when v goes from negative to positive. Consequently, g(v) goes from − 1 to + 1 (Fig. 12.16).

Fig. 12.16
figure 16

The logistic function

The DNN is able to model many functions by adjusting its parameters. To see why, consider neuron l. The output of this neuron indicates whether the linear combination V l =∑n θ n z n is larger or smaller than the thresholds a l of the neurons. Consequently, the first layer divides the set of inputs into regions separated by hyperplanes. The next layer then further divides these regions. The number of regions that can be obtained by this process is exponential in the number of layers. The final layer then assigns values to the regions, thus approximating a complex function of the input vector by an almost piecewise constant function.

The missing piece of the puzzle is that, unfortunately, the cost function is not a nice convex function of the parameters of the DNN. Instead, it typically has many local minima. Consequently, by using the SGD algorithm, the tuning of the DNN may get stuck in a local minimum. Also, to reduce the number of parameters to tune, one usually selects a few layers with fixed parameters, such as edge detectors in vision systems. Thus, the selection of the DNN becomes somewhat of an art, like cooking.

Thus, it remains impossible to predict whether the DNN will be a good technique for machine learning in a specific application. The answer of the practitioners is to try and see. If it works, they publish a paper. We are far from the proven convergence results of adaptive systems. Ah, nostalgia….

There is a worrisome aspect to these black-box approaches. When the DNN has been tuned and seems to perform well on many trials, not only one does not understand what it really does, but one has no guarantee that it will not seriously misbehave for some inputs. Imagine then a killer drone with a DNN target recognition system…. It is not surprising that a number of serious scientists have raised concerns about “artificial stupidity” and the need to build safeguards into such systems. “Open the pod bay doors, Hal.”

12.4.1 Calculating Derivatives

Let’s compute the derivative of Z with respect to θ k.

See you increase θ k by 𝜖. This increases V l by 𝜖z k. In turn, this increases z l by δz l := 𝜖z k f′(a l, V l), where f′(a l, V l) is the derivative of f(a l, V l) with respect to V l. Consequently, this increases V m by θ l δz l. The result is an increase of z m by δz m = θ l δz l f′(a m, V m). Finally, this increase V r by θ m δz m and Z by θ m δz m f′(a r, V r). We conclude that

$$\displaystyle \begin{aligned} \frac{dZ}{d \theta_k} = z_k f'(a_l, V_l) \theta_l f'(a_m,V_m) \theta_m f'(a_r,V_r). \end{aligned}$$

The details do not matter too much. The point is that the structure of the network makes the calculation of the derivatives straightforward.

12.5 Summary

  • Online Linear Regression;

  • Convex Sets and Functions;

  • Gradient Projection Algorithm;

  • Stochastic Gradient Projection Algorithm;

  • Deep Neural Networks;

  • Martingale Convergence Theorem;

  • Big Data: Relevant Data, Compressed Sensing, Recommendation Systems.

12.5.1 Key Equations and Formulas

Table 1

12.6 References

Online linear regression algorithms are discussed in Strehl and Littman (2007). The book Bertsekas and Tsitsiklis (1989) is an excellent presentation of distributed optimization algorithms. It explains the gradient projection algorithm and distributed implementations. The LASSO algorithm and many other methods are clearly explained in Hastie et al. (2009), together with applications. The theory of martingales is nicely presented by its father in Doob (1953). Theorem 12.4 is from Candes and Romberg (2007).

12.7 Problems

Problem 12.1

Let {Y n, n ≥ 1} be i.i.d. U[0, 1] random variables and {Z n, n ≥ 1} be i.i.d. \(\mathcal {N}(0, 1)\) random variables. Define X n = 1{Y n ≥ a} + Z n for some constant a. The goal of the problem is to design an algorithm that “learns” the value of a from the observation of pairs (X n, Y n). We construct a model

$$\displaystyle \begin{aligned} X_n = g(Y_n - \theta), \end{aligned}$$

where

$$\displaystyle \begin{aligned} g(u) = \frac{1}{1 + \exp\{ - \lambda u\}} \end{aligned} $$
(12.31)

with λ = 10. Note that when u > 0, the denominator of g(u) is close to 1, so that g(u) ≈ 1. Also, when u < 0, the denominator is large and g(u) ≈ 0. Thus, g(u) ≈ 1{u ≥ 0}. The function g(⋅) is called the logistic function. Use SGD in Python to estimate θ (Fig. 12.16).

Fig. 12.16
figure 17

The logistic function (12.31) with λ = 10

Problem 12.2

Implement the stepwise regression algorithm with

$$\displaystyle \begin{aligned} \varSigma_{\mathbf{Z}} = \left[ \begin{array}{c c c c} 10 & 5 & 6 & 7 \\ 5 & 6 & 5 & 2 \\ 6 & 5 & 11 & 5 \\ 7 & 2 & 5 & 6 \end{array} \right], \end{aligned}$$

where Z  = (Y, X 1, X 2, X 3) = (Y, X ).

Problem 12.3

Implement the compressed sensing algorithm with

$$\displaystyle \begin{aligned} s(t) = 3 \sin{}( 2\pi t) + 2 \sin{}( 3 \pi t) + 4 \sin (4 \pi t), t \in [0, 1], \end{aligned}$$

where you choose sampling times t k independently and uniformly in [0, 1]. Assume that the collection of sine waves has the frequencies {0.1, 0.2, …, 3}.

What is the minimum number of samples that you need for exact reconstruction?