An Invitation to Compressive Sensing

  • Simon Foucart
  • Holger Rauhut
Chapter
Part of the Applied and Numerical Harmonic Analysis book series (ANHA)

Abstract

This first chapter formulates the objectives of compressive sensing. It introduces the standard compressive problem studied throughout the book and reveals its ubiquity in many concrete situations by providing a selection of motivations, applications, and extensions of the theory. It concludes with an overview of the book that summarizes the content of each of the following chapters.

Keywords

sparsity compressibility algorithms random matrices stability single-pixel camera magnetic resonance imaging radar sampling theory sparse approximation error correction statistics and machine learning low-rank matrix recovery and matrix completion 

This first chapter introduces the standard compressive sensing problem and gives an overview of the content of this book. Since the mathematical theory is highly motivated by real-life problems, we also briefly describe some of the potential applications.

1.1 What is Compressive Sensing?

In many practical problems of science and technology, one encounters the task of inferring quantities of interest from measured information. For instance, in signal and image processing, one would like to reconstruct a signal from measured data. When the information acquisition process is linear, the problem reduces to solving a linear system of equations. In mathematical terms, the observed data \(\mathbf{y} \in {\mathbb{C}}^{m}\) is connected to the signal \(\mathbf{x} \in {\mathbb{C}}^{N}\) of interest via
$$\displaystyle{ \mathbf{A}\mathbf{x} = \mathbf{y}. }$$
(1.1)

The matrix \(\mathbf{A}\,\in \,{\mathbb{C}}^{m\times N}\) models the linear measurement (information) process. Then one tries to recover the vector \(\mathbf{x}\,\in \,{\mathbb{C}}^{N}\) by solving the above linear system. Traditional wisdom suggests that the number m of measurements, i.e., the amount of measured data, must be at least as large as the signal length N (the number of components of \(\mathbf{x}\)). This principle is the basis for most devices used in current technology, such as analog-to-digital conversion, medical imaging, radar, and mobile communication. Indeed, if m < N, then classical linear algebra indicates that the linear system (1.1) is underdetermined and that there are infinitely many solutions (provided, of course, that there exists at least one). In other words, without additional information, it is impossible to recover \(\mathbf{x}\) from \(\mathbf{y}\) in the case m < N. This fact also relates to the Shannon sampling theorem, which states that the sampling rate of a continuous-time signal must be twice its highest frequency in order to ensure reconstruction.

Thus, it came as a surprise that under certain assumptions it is actually possible to reconstruct signals when the number m of available measurements is smaller than the signal length N. Even more surprisingly, efficient algorithms do exist for the reconstruction. The underlying assumption which makes all this possible is sparsity. The research area associated to this phenomenon has become known as compressive sensing, compressed sensing, compressive sampling, or sparse recovery. This whole book is devoted to the mathematics underlying this field.

Sparsity. A signal is called sparse if most of its components are zero. As empirically observed, many real-world signals are compressible in the sense that they are well approximated by sparse signals—often after an appropriate change of basis. This explains why compression techniques such as JPEG, MPEG, or MP3 work so well in practice. For instance, JPEG relies on the sparsity of images in the discrete cosine basis or wavelet basis and achieves compression by only storing the largest discrete cosine or wavelet coefficients. The other coefficients are simply set to zero. We refer to Fig. 1.1 for an illustration of the fact that natural images are sparse in the wavelet domain.
Fig. 1.1

Antonella, Niels, and Paulina. Top: Original Image. Bottom: Reconstruction using 1% of the largest absolute wavelet coefficients, i.e., 99 % of the coefficients are set to zero

Let us consider again the acquisition of a signal and the resulting measured data. With the additional knowledge that the signal is sparse or compressible, the traditional approach of taking at least as many measurements as the signal length seems to waste resources: At first, substantial efforts are devoted to measuring all entries of the signal and then most coefficients are discarded in the compressed version. Instead, one would want to acquire the compressed version of a signal “directly” via significantly fewer measured data than the signal length—exploiting the sparsity or compressibility of the signal. In other words, we would like to compressively sense a compressible signal! This constitutes the basic goal of compressive sensing.

We emphasize that the main difficulty here lies in the locations of the nonzero entries of the vector \(\mathbf{x}\) not being known beforehand. If they were, one would simply reduce the matrix \(\mathbf{A}\) to the columns indexed by this location set. The resulting system of linear equations then becomes overdetermined and one can solve for the nonzero entries of the signal. Not knowing the nonzero locations of the vector to be reconstructed introduces some nonlinearity since s-sparse vectors (those having at most s nonzero coefficients) form a nonlinear set. Indeed, adding two s-sparse vectors gives a 2s-sparse vector in general. Thus, any successful reconstruction method will necessarily be nonlinear.

Intuitively, the complexity or “intrinsic” information content of a compressible signal is much smaller than its signal length (otherwise compression would not be possible). So one may argue that the required amount of data (number of measurements) should be proportional to this intrinsic information content rather than the signal length. Nevertheless, it is not immediately clear how to achieve the reconstruction in this scenario.

Looking closer at the standard compressive sensing problem consisting in the reconstruction of a sparse vector \(\mathbf{x} \in {\mathbb{C}}^{N}\) from underdetermined measurements \(\mathbf{y} = \mathbf{A}\mathbf{x} \in {\mathbb{C}}^{m}\), m < N, one essentially identifies two questions:
  • How should one design the linear measurement process? In other words, what matrices \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) are suitable?

  • How can one reconstruct \(\mathbf{x}\) from \(\mathbf{y} = \mathbf{A}\mathbf{x}\)? In other words, what are efficient reconstruction algorithms?

These two questions are not entirely independent, as the reconstruction algorithm needs to take \(\mathbf{A}\) into account, but we will see that one can often separate the analysis of the matrix \(\mathbf{A}\) from the analysis of the algorithm.

Let us notice that the first question is by far not trivial. In fact, compressive sensing is not fitted for arbitrary matrices \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\). For instance, if \(\mathbf{A}\) is made of rows of the identity matrix, then \(\mathbf{y} = \mathbf{A}\mathbf{x}\) simply picks some entries of \(\mathbf{x}\), and hence, it contains mostly zero entries. In particular, no information is obtained about the nonzero entries of \(\mathbf{x}\) not caught in \(\mathbf{y}\), and the reconstruction appears impossible for such a matrix \(\mathbf{A}\). Therefore, compressive sensing is not only concerned with the recovery algorithm—the first question on the design of the measurement matrix is equally important and delicate. We also emphasize that the matrix \(\mathbf{A}\) should ideally be designed for all signals \(\mathbf{x}\) simultaneously, with a measurement process which is nonadaptive in the sense that the type of measurements for the datum y j (i.e., the jth row of \(\mathbf{A}\)) does not depend on the previously observed data \(y_{1}, \ldots,y_{j-1}\). As it turns out, adaptive measurements do not provide better theoretical performance in general (at least in a sense to be made precise in  Chap. 10).

Algorithms. For practical purposes, the availability of reasonably fast reconstruction algorithms is essential. This feature is arguably the one which brought so much attention to compressive sensing. The first algorithmic approach coming to mind is probably 0-minimization. Introducing the notation \(\|\mathbf{x}\|_{0}\) for the number of nonzero entries of a vector \(\mathbf{x}\), it is natural to try to reconstruct \(\mathbf{x}\) as a solution of the combinatorial optimization problem
$$\displaystyle{\mathrm{minimize}\,\|\mathbf{z}\|_{0}\quad \mbox{ subject to }\mathbf{A}\mathbf{z} = \mathbf{y}.}$$
In words, we search for the sparsest vector consistent with the measured data \(\mathbf{y} = \mathbf{A}\mathbf{x}\). Unfortunately, 0-minimization is NP-hard in general. Thus, it may seem quite surprising that fast and provably effective reconstruction algorithms do exist. A very popular and by now well-understood method is basis pursuit or 1-minimization, which consists in finding the minimizer of the problem
$$\displaystyle{ \text{minimize}\,\|\mathbf{z}\|_{1}\quad \mbox{ subject to }\mathbf{A}\mathbf{z} = \mathbf{y}. }$$
(1.2)
Since the 1-norm \(\|\cdot \|_{1}\) is a convex function, this optimization problem can be solved with efficient methods from convex optimization. Basis pursuit can be interpreted as the convex relaxation of 0-minimization. Alternative reconstruction methods include greedy-type methods such as orthogonal matching pursuit, as well as thresholding-based methods including iterative hard thresholding. We will see that under suitable assumptions all these methods indeed do recover sparse vectors.
Before continuing, we invite the reader to look at Figs. 1.2 and 1.3, which illustrate the power of compressive sensing. They show an example of a signal of length 64, which is 5-sparse in the Fourier domain. It is recovered exactly by the method of basis pursuit ( 1-minimization) from only 16 samples in the time domain. For reference, a traditional linear method based on 2-minimization is also displayed. It clearly fails in reconstructing the original sparse spectrum.
Fig. 1.2

Top: 5-sparse vector of Fourier coefficients of length 64. Bottom: real part of time-domain signal with 16 samples

Fig. 1.3

Top: poor reconstruction via 2-minimization. Bottom: exact reconstruction via 1-minimization

Random Matrices. Producing adequate measurement matrices \(\mathbf{A}\) is a remarkably intriguing endeavor. To date, it is an open problem to construct explicit matrices which are provably optimal in a compressive sensing setting. Certain constructions from sparse approximation and coding theory (e.g., equiangular tight frames) yield fair reconstruction guarantees, but these fall considerably short of the optimal achievable bounds. A breakthrough is achieved by resorting to random matrices—this discovery can be viewed as the birth of compressive sensing. Simple examples are Gaussian matrices whose entries consist of independent random variables following a standard normal distribution and Bernoulli matrices whose entries are independent random variables taking the values + 1 and − 1 with equal probability. A key result in compressive sensing states that, with high probability on the random draw of an m ×N Gaussian or Bernoulli matrix \(\mathbf{A}\), all s-sparse vectors \(\mathbf{x}\) can be reconstructed from \(\mathbf{y} = \mathbf{A}\mathbf{x}\) using a variety of algorithms provided
$$\displaystyle{ m \geq Cs\ln (N/s), }$$
(1.3)
where C > 0 is a universal constant (independent of s, m, and N). This bound is in fact optimal.

According to (1.3), the amount m of data needed to recover s-sparse vectors scales linearly in s, while the signal length N only has a mild logarithmic influence. In particular, if the sparsity s is small compared to N, then the number m of measurements can also be chosen small in comparison to N, so that exact solutions of an underdetermined system of linear equations become plausible! This fascinating discovery impacts many potential applications.

We now invite the reader to examine Fig. 1.4. It compares the performance of two algorithms, namely, basis pursuit and hard thresholding pursuit, for the recovery of sparse vectors \(\mathbf{x} \in {\mathbb{C}}^{N}\) from the measurement vectors \(\mathbf{y} = \mathbf{A}\mathbf{x} \in {\mathbb{C}}^{m}\) based on simulations involving Gaussian random matrices \(\mathbf{A}\) and randomly chosen s-sparse vectors \(\mathbf{x}\). With a fixed sparsity s, the top plot shows the percentage of vectors \(\mathbf{x}\) that were successfully recovered as a function of the number m of measurements. In particular, it indicates how large m has to be in comparison with s for the recovery to be guaranteed. With a fixed number m of measurements, the bottom plot shows the percentage of vectors \(\mathbf{x}\) that were successfully recovered as a function of their sparsity s. In particular, it indicates how small s has to be in comparison with m for the recovery to be guaranteed. We note that the algorithm performing best is different for these two plots. This is due to the probability distribution chosen for the nonzero entries of the sparse vectors: The top plot used a Rademacher distribution while the bottom plot used a Gaussian distribution.
Fig. 1.4

Top: percentage of successful recoveries for Rademacher sparse vectors. Bottom: percentage of successful recoveries for Gaussian sparse vectors

The outlined recovery result extends from Gaussian random matrices to the more practical situation encountered in sampling theory. Here, assuming that a function of interest has a sparse expansion in a suitable orthogonal system (in trigonometric monomials, say), it can be recovered from a small number of randomly chosen samples (point evaluations) via 1-minimization or several other methods. This connection to sampling theory explains the alternative name compressive sampling.

Stability. Compressive sensing features another crucial aspect, namely, its reconstruction algorithms are stable. This means that the reconstruction error stays under control when the vectors are not exactly sparse and when the measurements \(\mathbf{y}\) are slightly inaccurate. In this situation, one may, for instance, solve the quadratically constrained 1-minimization problem
$$\displaystyle{ \text{minimize}\,\|\mathbf{z}\|_{1}\quad \mbox{ subject to }\|\mathbf{A}\mathbf{z} -\mathbf{y}\|_{2} \leq \eta. }$$
(1.4)
Without the stability requirement, the compressive sensing problem would be swiftly resolved and would not present much interest since most practical applications involve noise and compressibility rather than sparsity.

1.2 Applications, Motivations, and Extensions

In this section, we highlight a selection of problems that reduce to or can be modeled as the standard compressive sensing problem. We hope to thereby convince the reader of its ubiquity. The variations presented here take different flavors: technological applications (single-pixel camera, magnetic resonance imaging, radar), scientific motivations (sampling theory, sparse approximation, error correction, statistics and machine learning), and theoretical extensions (low-rank recovery, matrix completion). We do not delve into the technical details that would be necessary for a total comprehension. Instead, we adopt an informal style and we focus on the description of an idealized mathematical model. Pointers to references treating the details in much more depth are given in the Notes section concluding the chapter.

1.2.1 Single-Pixel Camera

Compressive sensing techniques are implemented in a device called the single-pixel camera. The idea is to correlate in hardware a real-world image with independent realizations of Bernoulli random vectors and to measure these correlations (inner products) on a single pixel. It suffices to measure only a small number of such random inner products in order to reconstruct images via sparse recovery methods.

For the purpose of this exposition, images are represented via gray values of pixels collected in the vector \(\mathbf{z} \in {\mathbb{R}}^{N}\), where N = N 1 N 2 and N 1, N 2 denote the width and height of the image in pixels. Images are not usually sparse in the canonical (pixel) basis, but they are often sparse after a suitable transformation, for instance, a wavelet transform or discrete cosine transform. This means that one can write \(\mathbf{z} = \mathbf{W}\mathbf{x}\), where \(\mathbf{x} \in {\mathbb{R}}^{N}\) is a sparse or compressible vector and \(\mathbf{W} \in {\mathbb{R}}^{N\times N}\) is a unitary matrix representing the transform.

The crucial ingredient of the single-pixel camera is a microarray consisting of a large number of small mirrors that can be turned on or off individually. The light from the image is reflected on this microarray and a lens combines all the reflected beams in one sensor, the single pixel of the camera; see Fig. 1.5. Depending on a small mirror being switched on or off, it contributes or not to the light intensity measured at the sensor. In this way, one realizes in hardware the inner product \(\langle \mathbf{z},\mathbf{b}\rangle\) of the image \(\mathbf{z}\) with a vector \(\mathbf{b}\) containing ones at the locations corresponding to switched-on mirrors and zeros elsewhere. In turn, one can also realize inner products with vectors \(\mathbf{a}\) containing only + 1 and − 1 with equal probability by defining two auxiliary vectors \({\mathbf{b}}^{1},{\mathbf{b}}^{2} \in \{ 0,{1\}}^{N}\) via
$$\displaystyle{b_{j}^{1} = \left \{\begin{array}{ll} 1&\mbox{ if }a_{j} = 1, \\ 0&\mbox{ if }a_{j} = -1, \end{array} \right.\quad b_{j}^{2} = \left \{\begin{array}{ll} 1&\mbox{ if }a_{j} = -1, \\ 0&\mbox{ if }a_{j} = 1, \end{array} \right.}$$
so that \(\langle \mathbf{z},\mathbf{a}\rangle =\langle \mathbf{z},{\mathbf{b}}^{1}\rangle -\langle \mathbf{z},{\mathbf{b}}^{2}\rangle\). Choosing vectors a 1, , a m independently at random with entries taking the values ± 1 with equal probability, the measured intensities \(y_{\ell} =\langle \mathbf{z},\mathbf{a}_{\ell}\rangle\) are inner products with independent Bernoulli vectors. Therefore, we have \(\mathbf{y} = \mathbf{A}\mathbf{z}\) for a (random) Bernoulli matrix \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\) whose action on the image \(\mathbf{z}\) has been realized in hardware. Recalling that \(\mathbf{z} = \mathbf{W}\mathbf{x}\) and writing \(\mathbf{A}^{\prime} = \mathbf{A}\mathbf{W}\) yield the system
$$\displaystyle{\mathbf{y} = \mathbf{A}^{\prime}\mathbf{x},}$$
where the vector \(\mathbf{x}\) is sparse or compressible. In this situation, the measurements are taken sequentially, and since this process may be time-consuming, it is desirable to use only few measurements. Thus, we have arrived at the standard compressive sensing problem. The latter allows for the reconstruction of \(\mathbf{x}\) from \(\mathbf{y}\) and finally the image is deduced as \(\mathbf{z} = \mathbf{W}\mathbf{x}\). We will justify in  Chap. 9 the validity of the accurate reconstruction from mC sln(Ns) measurements for images that are (approximately) s-sparse in some transform domain.
Fig. 1.5

Schematic representation of a single-pixel camera (Image courtesy of Rice University)

Although the single-pixel camera is more a proof of concept than a new trend in camera design, it is quite conceivable that similar devices will be used for different imaging tasks. In particular, for certain wavelengths outside the visible spectrum, it is impossible or at least very expensive to build chips with millions of sensor pixels on an area of only several square millimeters. In such a context, the potential of a technology based on compressive sensing is expected to really pay off.

1.2.2 Magnetic Resonance Imaging

Magnetic resonance imaging (MRI) is a common technology in medical imaging used for various tasks such as brain imaging, angiography (examination of blood vessels), and dynamic heart imaging. In traditional approaches (essentially based on the Shannon sampling theorem), the measurement time to produce high-resolution images can be excessive (several minutes or hours depending on the task) in clinical situations. For instance, heart patients cannot be expected to hold their breath for too long a time, and children are too impatient to sit still for more than about two minutes. In such situations, the use of compressive sensing to achieve high-resolution images based on few samples appears promising.

MRI relies on the interaction of a strong magnetic field with the hydrogen nuclei (protons) contained in the body’s water molecules. A static magnetic field polarizes the spin of the protons resulting in a magnetic moment. Applying an additional radio frequency excitation field produces a precessing magnetization transverse to the static field. The precession frequency depends linearly on the strength of the magnetic field. The generated electromagnetic field can be detected by sensors. Imposing further magnetic fields with a spatially dependent strength, the precession frequency depends on the spatial position as well. Exploiting the fact that the transverse magnetization depends on the physical properties of the tissue (for instance, proton density) allows one to reconstruct an image of the body from the measured signal.

In mathematical terms, we denote the transverse magnetization at position \(\mathbf{z}\,\in \,{\mathbb{R}}^{3}\) by \(X(\mathbf{z}) = \vert X(\mathbf{z})\vert {e}^{-i\phi (\mathbf{z})}\) where \(\vert X(\mathbf{z})\vert \) is the magnitude and \(\phi (\mathbf{z})\) is the phase. The additional possibly time-dependent magnetic field is designed to depend linearly on the position and is therefore called gradient field. Denoting by \(\mathbf{G} \in {\mathbb{R}}^{3}\) the gradient of this magnetic field, the precession frequency (being a function of the position in \({\mathbb{R}}^{3}\)) can be written as
$$\displaystyle{\omega (\mathbf{z}) =\kappa (B +\langle \mathbf{G},\mathbf{z}\rangle ),\quad \mathbf{z} \in {\mathbb{R}}^{3},}$$
where B is the strength of the static field and κ is a physical constant. With a time-dependent gradient \(\mathbf{G}: [0,T] \rightarrow {\mathbb{R}}^{3}\), the magnetization phase \(\phi (\mathbf{z}) =\phi (\mathbf{z},t)\) is the integral
$$\displaystyle{\phi (\mathbf{z},t) = 2\pi \kappa \int _{0}^{t}\langle \mathbf{G}(\tau ),\mathbf{z}\rangle d\tau,}$$
where t = 0 corresponds to the time of the radio frequency excitation. We introduce the function \(\mathbf{k}: [0,T] \rightarrow {\mathbb{R}}^{3}\) defined by
$$\displaystyle{\mathbf{k}(t) =\kappa \int _{ 0}^{t}\mathbf{G}(\tau )d\tau.}$$
The receiver coil integrates over the whole spatial volume and measures the signal
$$\displaystyle{f(t) =\int _{{\mathbb{R}}^{3}}\vert X(\mathbf{z})\vert {e}^{-2\pi i\langle \mathbf{k}(t),\mathbf{z}\rangle }d\mathbf{z} = \mathcal{F}(\vert X\vert )(\mathbf{k}(t)),}$$
where \(\mathcal{F}(\vert X\vert )(\boldsymbol{\xi })\)denotes the three-dimensional Fourier transform of the magnitude | X | of the magnetization. It is also possible to measure slices of a body, in which case the three-dimensional Fourier transform is replaced by a two-dimensional Fourier transform.

In conclusion, the signal measured by the MRI system is the Fourier transform of the spatially dependent magnitude of the magnetization | X | (the image), subsampled on the curve \(\{\mathbf{k}(t): t \in [0,T]\} \subset {\mathbb{R}}^{3}\). By repeating several radio frequency excitations with modified parameters, one obtains samples of the Fourier transform of | X | along several curves \(\mathbf{k}_{1}, \ldots,\mathbf{k}_{L}\) in \({\mathbb{R}}^{3}\). The required measurement time is proportional to the number L of such curves, and we would like to minimize this number L.

A natural discretization represents each volume element (or area element in case of two-dimensional imaging of slices) by a single voxel (or pixel), so that the magnitude of the magnetization | X | becomes a finite-dimensional vector \(\mathbf{x} \in {\mathbb{R}}^{N}\) indexed by \(Q:= [N_{1}] \times [N_{2}] \times [N_{3}]\) with \(N = \mathrm{card}(Q) = N_{1}N_{2}N_{3}\) and \([N_{i}]:=\{ 1,\ldots,N_{i}\}\). After discretizing the curves \(\mathbf{k}_{1}, \ldots,\mathbf{k}_{L}\), too, the measured data become samples of the three-dimensional discrete Fourier transform of \(\mathbf{x}\), i.e.,
$$\displaystyle{(\mathcal{F}\mathbf{x})_{\mathbf{k}} =\sum _{\boldsymbol{\ell}\in Q}x_{\boldsymbol{\ell}}{e}^{-2\pi i\sum _{j=1}^{3}k_{ j}\ell_{j}/N_{j}},\quad \mathbf{k} \in Q.}$$
Let KQ with card(K) = m denote a subset of the discretized frequency space Q, which is covered by the trajectories \(\mathbf{k}_{1}, \ldots,\mathbf{k}_{L}\). Then the measured data vector \(\mathbf{y}\) corresponds to
$$\displaystyle{\mathbf{y} = \mathbf{R}_{K}\mathcal{F}\mathbf{x} = \mathbf{A}\mathbf{x},}$$
where \(\mathbf{R}_{K}\) is the linear map that restricts a vector indexed by Q to its indices in K. The measurement matrix \(\mathbf{A} = \mathbf{R}_{K}\mathcal{F}\in {\mathbb{C}}^{m\times N}\) is a partial Fourier matrix. In words, the vector \(\mathbf{y}\) collects the samples of the three-dimensional Fourier transform of the discretized image \(\mathbf{x}\) on the set K. Since we would like to use a small number m of samples, we end up with an underdetermined system of equations.
In certain medical imaging applications such as angiography, it is realistic to assume that the image \(\mathbf{x}\) is sparse with respect to the canonical basis, so that we immediately arrive at the standard compressive sensing problem. In the general scenario, the discretized image \(\mathbf{x}\) will be sparse or compressible only after transforming into a suitable domain, using wavelets, for instance—in mathematical terms, we have \(\mathbf{x} = \mathbf{W}\mathbf{x}^{\prime}\) for some unitary matrix \(\mathbf{W} \in {\mathbb{C}}^{N\times N}\) and some sparse vector \(\mathbf{x}^{\prime} \in {\mathbb{C}}^{N}\). This leads to the model
$$\displaystyle{\mathbf{y} = \mathbf{A}^{\prime}\mathbf{x}^{\prime},}$$
with the transformed measurement matrix \(\mathbf{A}^{\prime}\,=\,\mathbf{A}\mathbf{W}\,=\,\mathbf{R}_{K}\mathcal{F}\mathbf{W}\,\in \,{\mathbb{C}}^{m\times N}\) and a sparse or compressible vector \(\mathbf{x}^{\prime}\,\in \,{\mathbb{C}}^{N}\). Again, we arrived at the standard compressive sensing problem.

The challenge is to determine good sampling sets K with small size that still ensure recovery of sparse images. The theory currently available predicts that sampling sets K chosen uniformly at random among all possible sets of cardinality m work well (at least when \(\mathbf{W}\) is the identity matrix). Indeed, the results of  Chap. 12 guarantee that an s-sparse \(\mathbf{x}^{\prime} \in {\mathbb{C}}^{N}\) can be reconstructed by 1-minimization if mC slnN.

Unfortunately, such random sets K are difficult to realize in practice due to the continuity constraints of the trajectories curves \(\mathbf{k}_{1}, \ldots,\mathbf{k}_{L}\). Therefore, good realizable sets K are investigated empirically. One option that seems to work well takes the trajectories as parallel lines in \({\mathbb{R}}^{3}\) whose intersections with a coordinate plane are chosen uniformly at random. This gives some sort of approximation to the case where K is “completely” random. Other choices such as perturbed spirals are also possible.

Figure 1.6 shows a comparison of a traditional MRI reconstruction technique with reconstruction via compressive sensing. The compressive sensing reconstruction has much better visual quality and resolves some clinically important details, which are not visible in the traditional reconstruction at all.
Fig. 1.6

Comparison of a traditional MRI reconstruction (left) and a compressive sensing reconstruction (right). The pictures show a coronal slice through an abdomen of a 3-year-old pediatric patient following an injection of a contrast agent. The image size was set to 320 ×256 ×160 voxels. The data were acquired using a 32-channel pediatric coil. The acquisition was accelerated by a factor of 7.2 by random subsampling of the frequency domain. The left image is a traditional linear reconstruction showing severe artifacts. The right image, a (wavelet-based) compressive sensing reconstruction, exhibits diagnostic quality and significantly reduced artifacts. The subtle features indicated with arrows show well on the compressive sensing reconstruction, while almost disappearing in the traditional one (Image courtesy of Michael Lustig, Stanford University, and Shreyas Vasanawala, Lucile Packard Children’s Hospital, Stanford University)

1.2.3 Radar

Compressive sensing can be applied to several radar frameworks. In the one presented here, an antenna sends out a properly designed electromagnetic wave—the radar pulse—which is scattered at objects in the surrounding environment, for instance, airplanes in the sky. A receive antenna then measures an electromagnetic signal resulting from the scattered waves. Based on the delay of the received signal, one can determine the distance of an object, and the Doppler effect allows one to deduce its speed with respect to the direction of view; see Fig. 1.7 for an illustration.
Fig. 1.7

Schematic illustration of a radar device measuring distances and velocities of objects

Let us describe a simple finite-dimensional model for this scenario. We denote by \((\mathbf{T}_{k}\mathbf{z})_{j} = z_{j-k\ \text{mod}\ m}\) the cyclic translation operator on \({\mathbb{C}}^{m}\) and by \((\mathbf{M}_{\ell}\mathbf{z})_{j} = {e}^{2\pi i\ell j/m}z_{j}\) the modulation operator on \({\mathbb{C}}^{m}\). The map transforming the sent signal to the received signal—also called channel—can be expressed as
$$\displaystyle{\mathbf{B} =\sum _{(k,\ell)\in {[m]}^{2}}x_{k,\ell}\mathbf{T}_{k}\mathbf{M}_{\ell},}$$
where the translations correspond to delay and the modulations to Doppler effect. The vector \(\mathbf{x} = (x_{k,\ell})\) characterizes the channel. A nonzero entry x k, occurs if there is a scattering object present in the surroundings with distance and speed corresponding to the shift \(\mathbf{T}_{k}\) and modulation \(\mathbf{M}_{\ell}\). Only a limited number of scattering objects are usually present, which translates into the sparsity of the coefficient vector \(\mathbf{x}\). The task is now to determine \(\mathbf{x}\) and thereby to obtain information about scatterers in the surroundings by probing the channel with a suitable known radio pulse, modeled in this finite-dimensional setup by a vector \(\mathbf{g} \in {\mathbb{C}}^{m}\). The received signal \(\mathbf{y}\) is given by
$$\displaystyle{\mathbf{y} = \mathbf{B}\mathbf{g} =\sum _{(k,\ell)\in {[m]}^{2}}x_{k,\ell}\mathbf{T}_{k}\mathbf{M}_{\ell}\mathbf{g} = \mathbf{A}_{\mathbf{g}}\mathbf{x},}$$
where the m 2 columns of the measurement matrix \(\mathbf{A}_{\mathbf{g}} \in {\mathbb{C}}^{m\times {m}^{2} }\) are equal to \(\mathbf{T}_{k}\mathbf{M}_{\ell}\mathbf{g}\), (k, ) ∈ [m]2. Recovering \(\mathbf{x} \in {\mathbb{C}}^{{m}^{2} }\) from the measured signal \(\mathbf{y}\) amounts to solving an underdetermined linear system. Taking the sparsity of \(\mathbf{x}\) into consideration, we arrive at the standard compressive sensing problem. The associated reconstruction algorithms, including 1-minimization, apply.
It remains to find suitable radio pulse sequences \(\mathbf{g} \in {\mathbb{C}}^{m}\) ensuring that \(\mathbf{x}\) can be recovered from \(\mathbf{y} = \mathbf{B}\mathbf{g}\). A popular choice of \(\mathbf{g}\) is the so-called Alltop vector, which is defined for prime m ≥ 5 as
$$\displaystyle{g_{\ell} = {e}^{2\pi {i\ell}^{3}/m },\quad \ell \in [m].}$$
We refer to  Chap. 5 for more details and to Fig. 1.8 for a numerical example.
Fig. 1.8

Top left: original 7-sparse coefficient vector (m = 59) in the translation–modulation (delay-Doppler) plane. Top right: reconstruction by 1-minimization using the Alltop window. Bottom: for comparison, the reconstruction by traditional 2-minimization

Although the Alltop window works well in practice, the theoretical guarantees currently available are somewhat limited due to the fact that \(\mathbf{g}\) is deterministic. As an alternative consistent with the general philosophy of compressive sensing, one can choose \(\mathbf{g} \in {\mathbb{C}}^{m}\) at random, for instance, as a Bernoulli vector with independent ± 1 entries. In this case, it is known that an s-sparse vector \(\mathbf{x} \in {\mathbb{C}}^{{m}^{2} }\) can be recovered from \(\mathbf{y} = \mathbf{B}\mathbf{x} \in {\mathbb{C}}^{m}\) provided sC m ∕ lnm. More information can be found in the Notes section of  Chap. 12.

1.2.4 Sampling Theory

Reconstructing a continuous-time signal from a discrete set of samples is an important task in many technological and scientific applications. Examples include image processing, sensor technology in general, and analog-to-digital conversion appearing, for instance, in audio entertainment systems or mobile communication devices. Currently, most sampling techniques rely on the Shannon sampling theorem, which states that a function of bandwidth B has to be sampled at the rate 2B in order to ensure reconstruction.

In mathematical terms, the Fourier transform of a continuous-time signal \(f \in {L}^{1}(\mathbb{R})\) (meaning that \(\int _{\mathbb{R}}\vert f(t)\vert dt < \infty \)) is defined by
$$\displaystyle{\hat{f}(\xi ) =\int _{\mathbb{R}}f(t){e}^{-2\pi it\xi }dt,\quad \xi \in \mathbb{R}.}$$
We say that f is bandlimited with bandwidth B if \(\hat{f}\) is supported in [ − B, B]. The Shannon sampling theorem states that such f can be reconstructed from its discrete set of samples \(\{f(k/(2B)),k \in \mathbb{Z}\}\) via the formula
$$\displaystyle{ f(t) =\sum _{k\in \mathbb{Z}}f\left ( \frac{k} {2B}\right )\mathrm{sinc}(2\pi Bt -\pi k), }$$
(1.5)
where the sinc function is given by
$$\displaystyle{\mathrm{sinc}(t) = \left \{\begin{array}{cc} \dfrac{\sin t} {t} & \mbox{ if }t\neq 0, \\ 1 &\mbox{ if }t = 0. \end{array} \right.}$$
To facilitate a comparison with compressive sensing, we also formulate the Shannon sampling theorem in a finite-dimensional setting. We consider trigonometric polynomials of maximal degree M, i.e., functions of the type
$$\displaystyle{ f(t) =\sum _{ k=-M}^{M}x_{ k}{e}^{2\pi ikt},\quad t \in [0,1]. }$$
(1.6)
The degree M serves as a substitute for the bandwidth B. Since the space of trigonometric polynomials of maximal degree M has dimension N = 2M + 1, it is expected that f can be reconstructed from N = 2M + 1 samples. Indeed, Theorem C.1 in the appendix states that
$$\displaystyle{f(t) = \frac{1} {2M + 1}\sum _{k=0}^{2M}f\left ( \frac{k} {2M + 1}\right )D_{M}\left (t - \frac{k} {2M + 1}\right ),\quad t \in [0,1],}$$
where the Dirichlet kernel D M is given by
$$\displaystyle{D_{M}(t) =\sum _{ k=-M}^{M}{e}^{2\pi ikt} = \left \{\begin{array}{cc} \dfrac{\sin (\pi (2M + 1)t)} {\sin (\pi t)} & \mbox{ if }t\neq 0, \\ 2M + 1 &\mbox{ if }t = 0. \end{array} \right.}$$
For dimensionality reasons, it is not possible to reconstruct trigonometric polynomials of maximal degree M from fewer than N = 2M + 1 samples. In practice, however, the required degree M may be large; hence, the number of samples must be large, too—sometimes significantly larger than realistic. So the question arises whether the required number of samples can be reduced by exploiting additional assumptions. Compressibility in the Fourier domain, for instance, is a reasonable assumption in many practical scenarios. In fact, if the vector \(\mathbf{x} \in {\mathbb{C}}^{N}\) of Fourier coefficients of f in (1.6) is sparse (or compressible), then few samples do suffice for exact (or approximate) reconstruction.
Precisely, given a set \(\{t_{1}, \ldots,t_{m}\} \subset [0,1]\) of m sampling points, we can write the vector \(\mathbf{y} = (f(t_{\ell}))_{\ell=1}^{m}\) as
$$\displaystyle{ \mathbf{y} = \mathbf{A}\mathbf{x} }$$
(1.7)
where \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) is a Fourier-type matrix with entries
$$\displaystyle{A_{\ell,k} = {e}^{2\pi ikt_{\ell}},\quad \ell = 1, \ldots,m,\quad k = -M, \ldots,M.}$$
The problem of recovering f from its vector \(\mathbf{y}\) of m samples reduces to finding the coefficient vector \(\mathbf{x}\). This amounts to solving the linear system (1.7), which is underdetermined when m < N. With the sparsity assumption, we arrive at the standard compressive sensing problem. A number of recovery algorithms, including 1-minimization, can then be applied. A crucial question now concerns the choice of sampling points. As indicated before, randomness helps. In fact, we will see in  Chap. 12 that choosing the sampling points t 1, …, t m independently and uniformly at random in [0, 1] allows one to reconstruct f with high probability from its m samples \(f(t_{1}), \ldots,f(t_{m})\) provided that mC sln(N). Thus, few samples suffice if s is small. An illustrating example was already displayed in Figs. 1.2 and 1.3.

1.2.5 Sparse Approximation

Compressive sensing builds on the empirical observation that many types of signals can be approximated by sparse ones. In this sense, compressive sensing can be seen as a subfield of sparse approximation. There is a specific problem in sparse approximation similar to the standard compressive sensing problem of recovering a sparse vector \(\mathbf{x} \in {\mathbb{C}}^{N}\) from the incomplete information \(\mathbf{y} = \mathbf{A}\mathbf{x} \in {\mathbb{C}}^{m}\) with m < N.

Suppose that a vector \(\mathbf{y} \in {\mathbb{C}}^{m}\) (usually a signal or an image in applications) is to be represented as a linear combination of prescribed elements \(\mathbf{a}_{1},\ldots,\mathbf{a}_{N} \in {\mathbb{C}}^{m}\) such that \(\mathrm{span}\{\mathbf{a}_{1},\ldots,\mathbf{a}_{N}\} = {\mathbb{C}}^{m}\). The system \((\mathbf{a}_{1},\ldots,\mathbf{a}_{N})\) is often called a dictionary. Note that this system may be linearly dependent (redundant) since we allow N > m. Redundancy may be desired when linearly independence is too restrictive. For instance, in time–frequency analysis, bases of time–frequency shifts elements are only possible if the generator has poor time–frequency concentration—this is the Balian–Low theorem. Unions of several bases are also of interest. In such situations, a representation \(\mathbf{y} =\sum _{ j=1}^{N}x_{j}\mathbf{a}_{j}\) is not unique. Traditionally, one removes this drawback by considering a representation with the smallest number of terms, i.e., a sparsest representation.

Let us now form the matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) with columns \(\mathbf{a}_{1},\ldots,\mathbf{a}_{N}\). Finding the sparsest representation of \(\mathbf{y}\) amounts to solving If we tolerate a representation error η, then one considers the slightly modified optimization problem The problem (P0) is the same as the one encountered in the previous section. Both optimization problems (P0) and (P0,η ) are NP-hard in general, but all the algorithmic approaches presented in this book for the standard compressive sensing problem, including 1-minimization, may be applied in this context to overcome the computational bottleneck. The conditions on \(\mathbf{A}\) ensuring exact or approximate recovery of the sparsest vector \(\mathbf{x}\), which will be derived in Chaps. 4, 5, and 6, remain valid.

There are, however, some differences in philosophy compared to the compressive sensing problem. In the latter, one is often free to design the matrix \(\mathbf{A}\) with appropriate properties, while \(\mathbf{A}\) is usually prescribed in the context of sparse approximation. In particular, it is not realistic to rely on randomness as in compressive sensing. Since it is hard to verify the conditions ensuring sparse recovery in the optimal parameter regime (m linear in s up to logarithmic factors), the theoretical guarantees fall short of the ones encountered for random matrices. An exception to this rule of thumb will be covered in  Chap. 14 where recovery guarantees are obtained for randomly chosen signals.

The second difference between sparse approximation and compressive sensing appears in the targeted error estimates. In compressive sensing, one is interested in the error \(\|\mathbf{x} -{\mathbf{x}}^{\sharp }\|\) at the coefficient level, where \(\mathbf{x}\) and \({\mathbf{x}}^{\sharp }\) are the original and reconstructed coefficient vectors, respectively, while in sparse approximation, the goal is to approximate a given \(\mathbf{y}\) with a sparse expansion \({\mathbf{y}}^{\sharp } =\sum _{j}x_{j}^{\sharp }\mathbf{a}_{j}\), so one is rather interested in \(\|\mathbf{y} -{\mathbf{y}}^{\sharp }\|\). An estimate for \(\|\mathbf{x} -{\mathbf{x}}^{\sharp }\|\) often yields an estimate for \(\|\mathbf{y} -{\mathbf{y}}^{\sharp }\| =\| \mathbf{A}(\mathbf{x} -{\mathbf{x}}^{\sharp })\|\), but the converse is not generally true.

Finally, we briefly describe some signal and image processing applications of sparse approximation.
  • Compression. Suppose that we have found a sparse approximation \(\hat{\mathbf{y}} = \mathbf{A}\hat{\mathbf{x}}\) of a signal \(\mathbf{y}\) with a sparse vector \(\hat{\mathbf{x}}\). Then storing \(\hat{\mathbf{y}}\) amounts to storing only the nonzero coefficients of \(\hat{\mathbf{x}}\). Since \(\hat{\mathbf{x}}\) is sparse, significantly less memory is required than for storing the entries of the original signal \(\mathbf{y}\).

  • Denoising. Suppose that we observe a noisy version \(\tilde{\mathbf{y}} = \mathbf{y} + \mathbf{e}\) of a signal \(\mathbf{y}\), where \(\mathbf{e}\) represents a noise vector with \(\|\mathbf{e}\| \leq \eta\). The task is then to remove the noise and to recover a good approximation of the original signal \(\mathbf{y}\). In general, if nothing is known about \(\mathbf{y}\), this problem becomes ill-posed. However, assuming that \(\mathbf{y}\) can be well represented by a sparse expansion, a reasonable approach consists in taking a sparse approximation of \(\tilde{\mathbf{y}}\). More precisely, we ideally choose the solution \(\hat{\mathbf{x}}\) of the 0-minimization problem (P0,η ) with \(\mathbf{y}\) replaced by the known signal \(\tilde{\mathbf{y}}\). Then we form \(\hat{\mathbf{y}} = \mathbf{A}\hat{\mathbf{x}}\) as the denoised version of \(\mathbf{y}\). For a computationally tractable approach, one replaces the NP-hard problem (P0,η ) by one of the compressive sensing (sparse approximation) algorithms, for instance, the 1-minimization variant (1.4) which takes noise into account, or the so-called basis pursuit denoising problem
    $$\displaystyle{\mathrm{minimize\;}\lambda \|\mathbf{z}\|_{1} +\| \mathbf{A}\mathbf{z} -\mathbf{y}\|_{2}^{2}.}$$
  • Data Separation. Suppose that a vector \(\mathbf{y} \in {\mathbb{C}}^{m}\) is the composition of two (or more) components, say \(\mathbf{y} = \mathbf{y}_{1} + \mathbf{y}_{2}\). Given \(\mathbf{y}\), we wish to extract the unknown vectors \(\mathbf{y}_{1},\mathbf{y}_{2} \in {\mathbb{C}}^{m}\). This problem appears in several signal processing tasks. For instance, astronomers would like to separate point structures (stars, galaxy clusters) from filaments in their images. Similarly, an audio processing task consists in separating harmonic components (pure sinusoids) from short peaks.

    Without additional assumption, this separation problem is ill-posed. However, if both components \(\mathbf{y}_{1}\) and \(\mathbf{y}_{2}\) have sparse representations in dictionaries \((\mathbf{a}_{1},\ldots,\mathbf{a}_{N_{1}})\) and \((\mathbf{b}_{1},\ldots,\mathbf{b}_{N_{2}})\) of different nature (for instance, sinusoids and spikes), then the situation changes. We can then write
    $$\displaystyle{\mathbf{y} =\sum _{ j=1}^{N_{1} }x_{1,j}\mathbf{a}_{j} +\sum _{ j=1}^{N_{2} }x_{2,j}\mathbf{b}_{j} = \mathbf{A}\mathbf{x},}$$
    where the matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times (N_{1}+N_{2})}\) has columns \(\mathbf{a}_{1},\ldots,\mathbf{a}_{N_{1}},\mathbf{b}_{1},\ldots,\mathbf{b}_{N_{2}}\) and the vector \(\mathbf{x} = {[x_{1,1},\ldots,x_{1,N_{1}},x_{2,1},\ldots,x_{2,N_{2}}]}^{\top }\) is sparse. The compressive sensing methodology then allows one—under certain conditions—to determine the coefficient vector \(\mathbf{x}\), hence to derive the two components \(\mathbf{y}_{1} =\sum _{ j=1}^{N_{1}}x_{1,j}\mathbf{a}_{j}\) and \(\mathbf{y}_{2} =\sum _{ j=1}^{N_{2}}x_{2,j}\mathbf{b}_{j}\).

1.2.6 Error Correction

In every realistic data transmission device, pieces of data are occasionally corrupted. To overcome this unavoidable issue, one designs schemes for the correction of such errors provided they do not occur too often.

Suppose that we have to transmit a vector \(\mathbf{z} \in {\mathbb{R}}^{n}\). A standard strategy is to encode it into a vector \(\mathbf{v} = \mathbf{B}\mathbf{z} \in {\mathbb{R}}^{N}\) of length N = n + m, where \(\mathbf{B} \in {\mathbb{R}}^{N\times n}\). Intuitively, the redundancy in \(\mathbf{B}\) (due to N > n) should help in identifying transmission errors. The number m reflects the amount of redundancy.

Assume that the receiver measures \(\mathbf{w} = \mathbf{v} + \mathbf{x} \in {\mathbb{R}}^{N}\), where \(\mathbf{x}\) represents transmission error. The assumption that transmission errors do not occur too often translates into the sparsity of \(\mathbf{x}\), say \(\|\mathbf{x}\|_{0} \leq s\). For decoding, we construct a matrix \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\)—called generalized checksum matrix—such that \(\mathbf{A}\mathbf{B} = \mathbf{0}\), i.e., all rows of \(\mathbf{A}\) are orthogonal to all columns of \(\mathbf{B}\). We then form the generalized checksum
$$\displaystyle{\mathbf{y} = \mathbf{A}\mathbf{w} = \mathbf{A}(\mathbf{v} + \mathbf{x}) = \mathbf{A}\mathbf{B}\mathbf{z} + \mathbf{A}\mathbf{x} = \mathbf{A}\mathbf{x}.}$$
We arrived at the standard compressive sensing problem with the matrix \(\mathbf{A}\) and the sparse error vector \(\mathbf{x}\). Under suitable conditions, the methodology described in this book allows one to recover \(\mathbf{x}\) and in turn the original transmit vector \(\mathbf{v} = \mathbf{w} -\mathbf{x}\). Then one solves the overdetermined system \(\mathbf{v} = \mathbf{B}\mathbf{z}\) to derive the data vector \(\mathbf{z}\).

For concreteness of the scheme, we may choose a matrix \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\) as a suitable compressive sensing matrix, for instance, a Gaussian random matrix. Then we select the matrix \(\mathbf{B} \in {\mathbb{R}}^{N\times n}\) with n + m = N in such a way that its columns span the orthogonal complement of the row space of \(\mathbf{A}\), thus guaranteeing that \(\mathbf{A}\mathbf{B} = \mathbf{0}\). With these choices, we are able to correct a number s of transmission errors as large as C m ∕ ln(Nm).

1.2.7 Statistics and Machine Learning

The goal of statistical regression is to predict an outcome based on certain input data. It is common to choose the linear model
$$\displaystyle{\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{e},}$$
where \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\)—often called design or predictor matrix in this context—collects the input data and \(\mathbf{y}\) the output data and \(\mathbf{e}\) is a random noise vector. The vector \(\mathbf{x}\) is a parameter that has to be estimated from the data. In a statistical framework, the notation (n, p) is generally used instead of (m, N), but we keep the latter for consistency. In a clinical study, e.g., the entries A j, k in the row associated to the jth patient may refer to blood pressure, weight, height, gene data, concentration of certain markers, etc. The corresponding output y j would be another quantity of interest, for instance, the probability that jth patient suffers a certain disease. Having data for m patients, the regression task is to fit the model, i.e., to determine the parameter vector \(\mathbf{x}\).

In practice, the number N of parameters is often much larger than the number m of observations, so even without noise, the problem of fitting the parameter \(\mathbf{x}\) is ill-posed without further assumption. In many cases, however, only a small number of parameters contribute towards the effect to be predicted, but it is a priori unknown which of these parameters are influential. This leads to sparsity in the vector \(\mathbf{x}\), and again we arrive at the standard compressive sensing problem. In statistical terms, determining a sparse parameter vector \(\mathbf{x}\) corresponds to selecting the relevant explanatory variables, i.e., the support of \(\mathbf{x}\). One also speaks of model selection.

The methods described in this book can be applied in this context, too. Still, there is a slight deviation from our usual setup due to the randomness of the noise vector \(\mathbf{e}\). In particular, instead of the quadratically constrained 1-minimization problem (1.4), one commonly considers the so-called LASSO (least absolute shrinkage and selection operator)
$$\displaystyle{ \mathrm{minimize\;}\|\mathbf{A}\mathbf{z} -\mathbf{y}\|_{2}^{2}\quad \mbox{ subject to }\|\mathbf{z}\|_{ 1} \leq \tau }$$
(1.8)
for an appropriate regularization parameter τ depending on the variance of the noise. Further variants are the Dantzig selector
$$\displaystyle{ \mathrm{minimize\;}\|\mathbf{z}\|_{1}\quad \mbox{ subject to }\|{\mathbf{A}}^{{\ast}}(\mathbf{A}\mathbf{z} -\mathbf{y})\|_{ \infty }\leq \lambda, }$$
(1.9)
or the 1-regularized problem (sometimes also called LASSO or basis pursuit denoising in the literature)
$$\displaystyle{\mathrm{minimize\;}\lambda \|\mathbf{z}\|_{1} +\| \mathbf{A}\mathbf{z} -\mathbf{y}\|_{2}^{2},}$$
again for appropriate choices of λ. We will not deal with the statistical context any further, but we simply mention that near-optimal statistical estimation properties can be shown for both the LASSO and the Dantzig selector under conditions on \(\mathbf{A}\) that are similar to the ones of the following chapters.
A closely related regression problem arises in machine learning. Given random pairs of samples \((t_{j},y_{j})_{j=1}^{m}\), where t j is some input parameter vector and y j is a scalar output, one would like to predict the output y for a future input data t. The model relating the output y to the input t is
$$\displaystyle{\mathbf{y} = f(t) + \mathbf{e},}$$
where \(\mathbf{e}\) is random noise. The task is to learn the function f based on training samples (t j , y j ). Without further hypotheses on f, this is an impossible task. Therefore, we assume that f has a sparse expansion in a given dictionary of functions ψ 1, …, ψ N , i.e., that f is written as
$$\displaystyle{f(t) =\sum _{ \ell=1}^{N}x_{\ell}\psi _{\ell}(t),}$$
where \(\mathbf{x}\) is a sparse vector. Introducing the matrix \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\) with entries
$$\displaystyle{A_{j,k} =\psi _{k}(t_{j}),}$$
we arrive at the model
$$\displaystyle{\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{e},}$$
and the task is to estimate the sparse coefficient vector \(\mathbf{x}\). This has the same form as the problem described above, and the same estimation procedures including the LASSO and the Dantzig selector apply.

1.2.8 Low-Rank Matrix Recovery and Matrix Completion

Let us finally describe an extension of compressive sensing together with some of its applications. Rather than recovering a sparse vector \(\mathbf{x} \in {\mathbb{C}}^{N}\), we now aim at recovering a matrix \(\mathbf{X} \in {\mathbb{C}}^{n_{1}\times n_{2}}\) from incomplete information. Sparsity is replaced by the assumption that \(\mathbf{X}\) has low rank. Indeed, the small complexity of the set of matrices with a given low rank compared to the set of all matrices makes the recovery of such matrices plausible.

For a linear map \(\mathcal{A}: {\mathbb{C}}^{n_{1}\times n_{2}} \rightarrow {\mathbb{C}}^{m}\) with m < n 1 n 2, suppose that we are given the measurement vector
$$\displaystyle{\mathbf{y} = \mathcal{A}(\mathbf{X}) \in {\mathbb{C}}^{m}.}$$
The task is to reconstruct \(\mathbf{X}\) from \(\mathbf{y}\). To stand a chance of success, we assume that \(\mathbf{X}\) has rank at most r ≪ min{n 1, n 2}. The naive approach of solving the optimization problem
$$\displaystyle{\mathrm{minimize\;}\mathrm{rank}(\mathbf{Z})\quad \mbox{ subject to }\mathcal{A}(\mathbf{Z}) = \mathbf{y}}$$
is NP-hard, but an analogy with the compressive sensing problem will help. To illustrate this analogy, we consider the singular value decomposition of \(\mathbf{X}\), i.e.,
$$\displaystyle{\mathbf{X} =\sum _{ \ell=1}^{n}\sigma _{ \ell}\mathbf{u}_{\ell}\mathbf{v}_{\ell}^{{\ast}}.}$$
Here, n = min{n 1, n 2}, \(\sigma _{1} \geq \sigma _{2} \geq \cdots \sigma _{n} \geq 0\) are the singular values of \(\mathbf{X}\), and \(\mathbf{u}_{\ell} \in {\mathbb{C}}^{n_{1}}\), \(\mathbf{v}_{\ell} \in {\mathbb{C}}^{n_{2}}\) are the left and right singular vectors, respectively. We refer to Appendix A.2 for details. The matrix \(\mathbf{X}\) is of rank r if and only if the vector \(\boldsymbol{\sigma }=\boldsymbol{\sigma } (\mathbf{X})\) of singular values is r-sparse, i.e., \(\mathrm{rank}(\mathbf{X}) =\|\boldsymbol{\sigma } (\mathbf{X})\|_{0}\). Having the 1-minimization approach for compressive sensing in mind, it is natural to introduce the so-called nuclear norm as the 1-norm of the singular values, i.e.,
$$\displaystyle{\|\mathbf{X}\|_{{\ast}} =\|\boldsymbol{\sigma } (\mathbf{X})\|_{1} =\sum _{ \ell=1}^{n}\sigma _{ \ell}(\mathbf{X}).}$$
Then we consider the nuclear norm minimization problem
$$\displaystyle{ \mathrm{minimize\;}\|\mathbf{Z}\|_{{\ast}}\quad \mbox{ subject to }\mathcal{A}(\mathbf{Z}) = \mathbf{y}. }$$
(1.10)
This is a convex optimization problem which can be solved efficiently, for instance, after reformulation as a semidefinite program.
A theory very similar to the recovery of sparse vectors can be developed, and appropriate conditions on \(\mathcal{A}\) ensure exact or approximate recovery via nuclear norm minimization (and other algorithms). Again, random maps \(\mathcal{A}\) turn out to be optimal, and matrices \(\mathbf{X}\) of rank at most r can be recovered from m measurements with high probability provided
$$\displaystyle{m \geq Cr\max \{n_{1},n_{2}\}.}$$
This bound is optimal since the right-hand side corresponds to the number of degrees of freedom required to describe an n 1 ×n 2 matrix of rank r. In contrast to the vector case, there is remarkably no logarithmic factor involved.

As a popular special case, the matrix completion problem seeks to fill in missing entries of a low-rank matrix. Thus, the measurement map \(\mathcal{A}\) samples the entries \(\mathcal{A}(\mathbf{X})_{\ell} = X_{j,k}\) for some indices j, k depending on . This setup appears, for example, in consumer taste prediction. Assume that an (online) store sells products indexed by the rows of the matrix and consumers—indexed by the columns—are able to rate these products. Not every consumer will rate every product, so only a limited number of entries of this matrix are available. For purposes of individualized advertisement, the store is interested in predicting the whole matrix of consumer ratings. Often, if two customers both like some subset of products, then they will also both like or dislike other subsets of products (the “types” of customers are essentially limited). For this reason, it can be assumed that the matrix of ratings has (at least approximately) low rank, which is confirmed empirically. Therefore, methods from low-rank matrix recovery, including the nuclear norm minimization approach, apply in this setup.

Although certainly interesting, we will not treat low-rank recovery extensively in this book. Nevertheless, due to the close analogy with sparse recovery, the main results are covered in exercises, and the reader is invited to work through them.

1.3 Overview of the Book

Before studying the standard compressive sensing problem on a technical level, it is beneficial to draw a road map of the basic results and solving strategies presented in this book.

As previously revealed, the notions of sparsity and compressibility are at the core of compressive sensing. A vector \(\mathbf{x} \in {\mathbb{C}}^{N}\) is called s-sparse if it has at most s nonzero entries, in other words, if \(\|\mathbf{x}\|_{0}:= \mathrm{card}(\{j: x_{j}\neq 0\})\) is smaller than or equal to s. The notation \(\|\mathbf{x}\|_{0}\) has become customary, even though it does not represent a norm. In practice, one encounters vectors that are not exactly s-sparse but compressible in the sense that they are well approximated by sparse ones. This is quantified by the error of best s-term approximation to \(\mathbf{x}\) given by
$$\displaystyle{\sigma _{s}(\mathbf{x})_{p}:=\inf _{\|\mathbf{z}\|_{0}\leq s}\|\mathbf{x} -\mathbf{z}\|_{p}.}$$
Chapter  2 introduces these notions formally, establishes relations to weak p -quasinorms, and shows elementary estimates for the error of best s-term approximation, including
$$\displaystyle{ \sigma _{s}(\mathbf{x})_{2} \leq \frac{1} {{s}^{1/p-1/2}}\|\mathbf{x}\|_{p},\quad p \leq 2. }$$
(1.11)
This suggests that unit balls in the p -quasinorm for small p ≤ 1 are good models for compressible vectors. We further study the problem of determining the minimal number m of measurements—namely, m = 2s—required (at least in principle) to recover all s-sparse vectors \(\mathbf{x}\) from \(\mathbf{y} = \mathbf{A}\mathbf{x}\) with a matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\). It is remarkable that the actual length N of the vectors \(\mathbf{x}\) does not play any role. The basic recovery procedure associated to this first recovery guarantee is the 0-minimization, i.e.,
$$\displaystyle{\mathrm{minimize\;}\|\mathbf{z}\|_{0}\quad \mbox{ subject to }\mathbf{A}\mathbf{z} = \mathbf{y}.}$$
We will show in Sect.  2.3 that the 0-minimization is NP-hard by relating it to the exact cover by 3-sets problem, which is known to be NP-complete. Thus, 0-minimization is intractable in general, hence useless for practical purposes.

In order to circumvent the computational bottleneck of 0-minimization, we introduce several tractable alternatives in  Chap. 3. Here, rather than a detailed analysis, we only present some intuitive justification and elementary results for these recovery algorithms. They can be subsumed under roughly three categories: optimization methods, greedy methods, and thresholding-based methods. The optimization approaches include the 1-minimization (1.2) (also called basis pursuit) and the quadratically constrained 1-minimization (1.4) (sometimes also called basis pursuit denoising in the literature), which takes potential measurement error into account. These minimization problems can be solved with various methods from convex optimization such as interior-point methods. We will also present specialized numerical methods for 1-minimization later in  Chap. 15.

Orthogonal matching pursuit is a greedy method that builds up the support set of the reconstructed sparse vector iteratively by adding one index to the current support set at each iteration. The selection process is greedy because the index is chosen to minimize the residual at each iteration. Another greedy method is compressive sampling matching pursuit (CoSaMP). At each iteration, it selects several elements of the support set and then refines this selection.

The simple recovery procedure known as basic thresholding determines the support set in one step by choosing the s indices maximizing the correlations \(\vert \langle \mathbf{x},\mathbf{a}_{j}\rangle \vert \) of the vector \(\mathbf{x}\) with the columns of \(\mathbf{A}\). The reconstructed vector is obtained after an orthogonal projection on the span of the corresponding columns. Although this method is very fast, its performance is limited. A more powerful method is iterative hard thresholding. Starting with \({\mathbf{x}}^{0} = \mathbf{0}\), say, it iteratively computes
$$\displaystyle{{\mathbf{x}}^{n+1} = H_{ s}\left ({\mathbf{x}}^{n} +{ \mathbf{A}}^{{\ast}}(\mathbf{y} -\mathbf{A}{\mathbf{x}}^{n})\right ),}$$
where H s denotes the hard thresholding operator that keeps the s largest absolute entries of a vector and sets the other entries to zero. In the absence of the operator H s , this is well known in the area of inverse problems as Landweber iterations. Applying H s ensures sparsity of \({\mathbf{x}}^{n}\) at each iteration. We will finally present the hard thresholding pursuit algorithm which combines iterative hard thresholding with an orthogonal projection step.

Chapter  4 is devoted to the analysis of basis pursuit ( 1-minimization). First, we derive conditions for the exact recovery of sparse vectors. The null space property of order s is a necessary and sufficient condition (on the matrix \(\mathbf{A}\)) for the success of exact recovery of all s-sparse vectors \(\mathbf{x}\) from \(\mathbf{y} = \mathbf{A}\mathbf{x}\) via 1-minimization. It basically requires that every vector in the null space of \(\mathbf{A}\) is far from being sparse. This is natural, since a nonzero vector \(\mathbf{x} \in \ker \mathbf{A}\) cannot be distinguished from the zero vector using \(\mathbf{y} = \mathbf{A}\mathbf{x} = \mathbf{0}\). Next, we refine the null space property—introducing the stable null space property and the robust null space property—to ensure that 1-recovery is stable under sparsity defect and robust under measurement error. We also derive conditions that ensure the 1-recovery of an individual sparse vector. These conditions (on the vector \(\mathbf{x}\) and the matrix \(\mathbf{A}\)) are useful in later chapters to establish so-called nonuniform recovery results for randomly chosen measurement matrices. The chapter is brought to an end with two small detours. The first one is a geometric interpretation of conditions for exact recovery. The second one considers low-rank recovery and the nuclear norm minimization (1.10). The success of the latter is shown to be equivalent to a suitable adaptation of the null space property. Further results concerning low-rank recovery are treated in exercises spread throughout the book.

The null space property is not easily verifiable by a direct computation. The coherence, introduced in  Chap. 5, is a much simpler concept to assess the quality of a measurement matrix. For \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) with 2-normalized columns \(\mathbf{a}_{1},\ldots,\mathbf{a}_{N}\), it is defined as
$$\displaystyle{\mu:=\max _{j\neq k}\vert \langle \mathbf{a}_{j},\mathbf{a}_{k}\rangle \vert.}$$
We also introduce the 1-coherence function μ 1 as a slight refinement of the coherence. Ideally, the coherence μ of a measurement matrix should be small. A fundamental lower bound on μ (a related bound on μ 1 holds, too) is
$$\displaystyle{\mu \geq \sqrt{ \frac{N - m} {m(N - 1)}}.}$$
For large N, the right-hand side scales like \(1/\sqrt{m}\). The matrices achieving this lower bound are equiangular tight frames. We investigate conditions on m and N for the existence of equiangular tight frames and provide an explicit example of an m ×m 2 matrix (m being prime) with near-minimal coherence. Finally, based on the coherence, we analyze several recovery algorithms, in particular 1-minimization and orthogonal matching pursuit. For both of them, we obtain a verifiable sufficient condition for the recovery of all s-sparse vectors \(\mathbf{x}\) from \(\mathbf{y} = \mathbf{A}\mathbf{x}\), namely,
$$\displaystyle{(2s - 1)\mu < 1.}$$
Consequently, for a small enough sparsity, the algorithms are able to recover sparse vectors from incomplete information. Choosing a matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) with near-minimal coherence of order \(c/\sqrt{m}\) (which imposes some mild conditions on N), s-sparse recovery is achievable with m of order s 2. In particular, s-sparse recovery is achievable from incomplete information (mN) when s is small (\(s \ll \sqrt{N}\)). As already outlined, this can be significantly improved. In fact, we will see in later chapters that the optimal order for m is sln(Ns). But for now the lower bound \(\mu \geq c/\sqrt{m}\) implies that the coherence-based approach relying on (2s − 1)μ < 1 necessitates
$$\displaystyle{ m \geq C{s}^{2}. }$$
(1.12)
This yields a number of measurements that scale quadratically in the sparsity rather than linearly (up to logarithmic factors). However, the coherence-based approach has the advantage of simplicity (the analysis of various recovery algorithms is relatively short) and of availability of explicit (deterministic) constructions of measurement matrices.
The concept of restricted isometric property (RIP) proves very powerful to overcome the quadratic bottleneck (1.12). The restricted isometry constant δ s of a matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) is defined as the smallest δ ≥ 0 such that
$$\displaystyle{(1-\delta )\|\mathbf{x}\|_{2}^{2} \leq \|\mathbf{A}\mathbf{x}\|_{ 2}^{2} \leq (1+\delta )\|\mathbf{x}\|_{ 2}^{2}\quad \mbox{ for all }s\mbox{ -sparse }\mathbf{x}.}$$
Informally, the matrix \(\mathbf{A}\) is said to possess the RIP if δ s is small for sufficiently large s. The RIP requires all submatrices formed by s columns of \(\mathbf{A}\) to be well conditioned, since \(\mathbf{A}\mathbf{x} = \mathbf{A}_{S}\mathbf{x}_{S}\) whenever \(\mathbf{x} \in {\mathbb{C}}^{N}\) is supported on a set S of size s. Here, \(\mathbf{A}_{S} \in {\mathbb{C}}^{m\times s}\) denotes the submatrix formed with columns of \(\mathbf{A}\) indexed by S and \(\mathbf{x}_{S} \in {\mathbb{C}}^{s}\) denotes the restriction of \(\mathbf{x}\) to S.

Chapter  6 starts with basic results on the restricted isometry constants. For instance, there is the relation δ 2 = μ with the coherence when the columns of \(\mathbf{A}\) are 2-normalized. In this sense, restricted isometry constants generalize the coherence by considering all s-tuples rather than all pairs of columns. Other relations include the simple (and quite pessimistic) bound δ s ≤ (s − 1)μ, which can be derived directly from Gershgorin’s disk theorem.

We then turn to the analysis of the various recovery algorithms based on the restricted isometry property of \(\mathbf{A}\). Typically, under conditions of the type
$$\displaystyle{ \delta _{\kappa s} \leq \delta _{{\ast}} }$$
(1.13)
for some integer κ and some threshold δ < 1 (both depending only on the algorithm), every s-sparse vector \(\mathbf{x}\) is recoverable from \(\mathbf{y} = \mathbf{A}\mathbf{x}\). The table below summarizes the sufficient conditions for basis pursuit, iterative hard thresholding, hard thresholding pursuit, orthogonal matching pursuit, and compressive sampling matching pursuit.
Moreover, the reconstructions are stable when sparsity is replaced by compressibility and robust when measurement error occurs. More precisely, denoting by \({\mathbf{x}}^{\sharp }\) the output of the above algorithms run with \(\mathbf{y} = \mathbf{A}\mathbf{x} + \mathbf{e}\) and \(\|\mathbf{e}\|_{2} \leq \eta\), the error estimates
$$\displaystyle\begin{array}{rcl} \|\mathbf{x} -{\mathbf{x}}^{\sharp }\|_{ 2}& \leq C\frac{\sigma _{s}(\mathbf{x})_{1}} {\sqrt{s}} + D\eta,&{}\end{array}$$
(1.14)
$$\displaystyle\begin{array}{rcl} \|\mathbf{x} -{\mathbf{x}}^{\sharp }\|_{ 1}& \leq C\sigma _{s}(\mathbf{x})_{1} + D\sqrt{s}\eta,&{}\end{array}$$
(1.15)
hold for all \(\mathbf{x} \in {\mathbb{C}}^{N}\) with absolute constants C, D > 0.

BP

IHT

HTP

OMP

CoSaMP

 

δ 2s < 0. 6248 

δ 3s < 0. 5773 

δ 3s < 0. 5773 

δ 13s < 0. 1666 

δ 4s < 0. 4782 

 

At the time of writing, finding explicit (deterministic) constructions of matrices satisfying (1.13) in the regime where m scales linearly in s up to logarithmic factors is an open problem. The reason lies in the fact that usual tools (such as Gershgorin’s theorem) to estimate condition numbers essentially involve the coherence (or 1-coherence function), as in δ κ s ≤ (κ s − 1)μ. Bounding the latter by a fixed δ still faces the quadratic bottleneck (1.12).

We resolve this issue by passing to random matrices. Then a whole new set of tools from probability theory becomes available. When the matrix \(\mathbf{A}\) is drawn at random, these tools enable to show that the restricted isometry property or other conditions ensuring recovery hold with high probability provided mC sln(Ns). Chapters  7 and  8 introduce all the necessary background on probability theory.

We start in  Chap. 7 by recalling basic concepts such as expectation, moments, Gaussian random variables and vectors, and Jensen’s inequality. Next, we treat the relation between the moments of a random variable and its tails. Bounds on the tails of sums of independent random variables will be essential later, and Cramér’s theorem provides general estimates involving the moment generating functions of the random variables. Hoeffding’s inequality specializes to the sum of independent bounded mean-zero random variables. Gaussian and Rademacher/Bernoulli variables (the latter taking the values + 1 or − 1 with equal probability) fall into the larger class of subgaussian random variables, for which we also present basic results. Finally, Bernstein inequalities refine Hoeffding’s inequality by taking into account the variance of the random variables. Furthermore, they extend to possibly unbounded subexponential random variables.

For many compressive sensing results with Gaussian or Bernoulli random matrices—that is, for large parts of Chaps.  9 and  11, including bounds for the restricted isometry constants—the relatively simple tools of  Chap. 7 are already sufficient. Several topics in compressive sensing, however, notably the analysis of random partial Fourier matrices, build on more advanced tools from probability theory. Chapter  8 presents the required material. For instance, we cover Rademacher sums of the form \(\sum _{j}\epsilon _{j}a_{j}\) where the ε j = ± 1 are independent Rademacher variables and the symmetrization technique leading to such sums. Khintchine inequalities bound the moments of Rademacher sums. The noncommutative Bernstein inequality provides a tail bound for the operator norm of independent mean-zero random matrices. Dudley’s inequality bounds the expected supremum over a family of random variables by a geometric quantity of the set indexing the family. Concentration of measure describes the high-dimensional phenomenon which sees functions of random vectors concentrating around their means. Such a result is presented for Lipschitz functions of Gaussian random vectors.

With the probabilistic tools at hand, we are prepared to study Gaussian, Bernoulli, and more generally subgaussian random matrices in  Chap. 9. A crucial ingredient for the proof of the restricted isometry property is the concentration inequality
$$\displaystyle{ \mathbb{P}(\vert \|\mathbf{A}\mathbf{x}\|_{2}^{2} -\|\mathbf{x}\|_{ 2}^{2}\vert \geq t\|\mathbf{x}\|_{ 2}^{2}) \leq 2\exp (-cm{t}^{2}), }$$
(1.16)
valid for any fixed \(\mathbf{x} \in {\mathbb{R}}^{N}\) and t ∈ (0, 1) with a random draw of a properly scaled m ×N subgaussian random matrix \(\mathbf{A}\). Using covering arguments—in particular, exploiting bounds on covering numbers from Appendix C.2—we deduce that the restricted isometry constants satisfy δ s δ with high probability provided
$$\displaystyle{ m \geq {C\delta }^{-2}s\ln (eN/s). }$$
(1.17)
The invariance of the concentration inequality under orthogonal transformations implies that subgaussian random matrices are universal in the sense that they allow for the recovery of vectors that are sparse not only in the canonical basis but also in an arbitrary (but fixed) orthonormal basis.
In the special case of Gaussian random matrices, one can exploit refined methods not available in the subgaussian case, such as Gordon’s lemma and concentration of measure. We will deduce good explicit constants in the nonuniform setting where we only target recovery of a fixed s-sparse vector using a random draw of an m ×N Gaussian matrix. For large dimensions, we roughly obtain that
$$\displaystyle{m > 2s\ln (N/s)}$$
is sufficient to recover an s-sparse vector using 1-minimization; see  Chap. 9 for precise statements. This is the general rule of thumb reflecting the outcome of empirical tests, even for nongaussian random matrices—although the proof applies only to the Gaussian case.

We close  Chap. 9 with a detour to the Johnson–Lindenstrauss lemma which states that a finite set of points in a large dimensional space can be mapped to a significantly lower-dimensional space while almost preserving all mutual distances (no sparsity assumption is involved here). This is somewhat equivalent to the concentration inequality (1.16). In this sense, the Johnson–Lindenstrauss lemma implies the RIP. We will conversely show that if a matrix satisfies the RIP, then randomizing the signs of its column yields a Johnson–Lindenstrauss embedding with high probability.

In  Chap. 10, we show that the number of measurements (1.3) for sparse recovery using subgaussian random matrices is optimal. This is done by relating the standard compressive sensing problem to Gelfand widths of 1-balls. More precisely, for a subset K of a normed space \(X = ({\mathbb{R}}^{N},\|\cdot \|)\) and for m < N, we introduce the quantity
$$\displaystyle{{E}^{m}(K,X):=\inf \left \{\sup _{\mathbf{ x}\in K}\|\mathbf{x} - \Delta (\mathbf{A}\mathbf{x})\|,\;\mathbf{A} \in {\mathbb{R}}^{m\times N},\;\Delta : {\mathbb{R}}^{m} \rightarrow {\mathbb{R}}^{N}\right \}.}$$
It quantifies the worst-case reconstruction error over K of optimal measurement/reconstruction schemes in compressive sensing. The Gelfand width of K is defined as
$$\displaystyle{{d}^{m}(K,X):=\inf \left \{\sup _{\mathbf{ x}\in K\cap \ker \mathbf{A}}\|\mathbf{x}\|,\;\mathbf{A} \in {\mathbb{R}}^{m\times N}\right \}.}$$
If K = − K and K + KaK for some constant a, as it is the case with a = 2 for the unit ball of some norm, then
$$\displaystyle{{d}^{m}(K,X) \leq {E}^{m}(K,X) \leq a{d}^{m}(K,X).}$$
Since by (1.11) unit balls K = B p N in the N-dimensional p -space, p ≤ 1, are good models for compressible vectors, we are led to study their Gelfand widths. For ease of exposition, we only cover the case p = 1. An upper bound for \({E}^{m}(B_{1}^{N},\ell_{2}^{N})\), and thereby for \({d}^{m}(B_{1}^{N},\ell_{2}^{N})\), can be easily derived from the error estimate (1.14) combined with the number of measurements that ensure the RIP for subgaussian random matrices. This gives
$$\displaystyle{{d}^{m}(B_{ 1}^{N},\ell_{ 2}^{N}) \leq C\min {\left \{1, \frac{\ln (\mathit{eN}/m)} {m} \right \}}^{1/2}.}$$
We derive the matching lower bound
$$\displaystyle{{d}^{m}(B_{ 1}^{N},\ell_{ 2}^{N}) \geq c\min {\left \{1, \frac{\ln (\mathit{eN}/m)} {m} \right \}}^{1/2},}$$
and we deduce that the bound (1.17) is necessary to guarantee the existence of a stable scheme for s-sparse recovery. An intermediate step in the proof of this lower bound is of independent interest. It states that a necessary condition on the number of measurements to guarantee that every s-sparse vector \(\mathbf{x}\) is recoverable from \(\mathbf{y} = \mathbf{A}\mathbf{x}\) via 1-minimization (stability is not required) is
$$\displaystyle{ m \geq Cs\ln (\mathit{eN}/s). }$$
(1.18)
The error bound (1.14) includes the term \(\sigma _{s}(\mathbf{x})_{1}/\sqrt{s}\), although the error is measured in 2-norm. This raises the question of the possibility of an error bound with the term \(\sigma _{s}(\mathbf{x})_{2}\) on the right-hand side. Chapter  11 investigates this question and the more general question of the existence of pairs of measurement matrix \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\) and reconstruction map \(\Delta : {\mathbb{R}}^{m} \rightarrow {\mathbb{R}}^{N}\) satisfying
$$\displaystyle{\|\mathbf{x} - \Delta (\mathbf{A}\mathbf{x})\|_{q} \leq \frac{C} {{s}^{1/p-1/q}}\sigma _{s}(\mathbf{x})_{p}\quad \mbox{ for all }\mathbf{x} \in {\mathbb{R}}^{N}.}$$
This bound is referred to as mixed ( q , p )-instance optimality and simply as p -instance optimality when q = p. The 1-instance optimality implies the familiar bound mC sln(eNs). However, 2-instance optimality necessarily leads to
$$\displaystyle{m \geq cN.}$$
This regime of parameters is not interesting in compressive sensing. However, we may ask for less, namely, that the error bound in 2 holds in a nonuniform setting, i.e., for fixed \(\mathbf{x}\) with high probability on a draw of a subgaussian random matrix \(\mathbf{A}\). As it turns out, with Δ 1 denoting the 1-minimization map, the error bound
$$\displaystyle{\|\mathbf{x} - \Delta _{1}(\mathbf{A}\mathbf{x})\|_{2} \leq C\sigma _{s}(\mathbf{x})_{2}}$$
does hold with high probability under the condition mC sln(eNs). The analysis necessitates the notion of 1-quotient property. It is proved for Gaussian random matrices, and a slight variation is proved for subgaussian random matrices.
In addition,  Chap. 11 investigates a question about measurement error. When it is present, one may use the quadratically constrained 1-minimization
$$\displaystyle{\mathrm{minimize\;}\|\mathbf{z}\|_{1}\quad \mbox{ subject to }\|\mathbf{A}\mathbf{z} -\mathbf{y}\|_{2} \leq \eta,}$$
yet this requires an estimation of the noise level η (other algorithms do not require an estimation of η, but they require an estimation of the sparsity level s instead). Only then are the error bounds (1.14) and (1.15) valid under RIP. We will establish that, somewhat unexpectedly, the equality-constrained 1-minimization (1.2) can also be performed in the presence of measurement error using Gaussian measurement matrices. Indeed, the 1-quotient property implies the same reconstruction bounds (1.14) and (1.15) even without knowledge of the noise level η.

Subgaussian random matrices are of limited practical use, because specific applications may impose a structure on the measurement matrix that totally random matrices lack. As mentioned earlier, deterministic measurement matrices providing provable recovery guarantees are missing from the current theory. This motivates the study of structured random matrices. In  Chap. 12, we investigate a particular class of structured random matrices arising in sampling problems. This includes random partial Fourier matrices.

Let (ψ 1, . ψ N ) be a system of complex-valued functions which are orthonormal with respect to some probability measure ν on a set \(\mathcal{D}\), i.e.,
$$\displaystyle{\int _{\mathcal{D}}\psi _{j}(t)\overline{\psi _{k}(t)}d\nu (t) =\delta _{j,k}.}$$
We call this system a bounded orthonormal system if there exists a constant K ≥ 1 (ideally independent of N) such that
$$\displaystyle{\sup _{1\leq j\leq N}\sup _{t\in \mathcal{D}}\vert \psi _{j}(t)\vert \leq K.}$$
A particular example is the trigonometric system where \(\psi _{j}(t) = {e}^{2\pi ijt}\) for \(j \in \Gamma \subset \mathbb{Z}\) with card(Γ) = N, in which case K = 1. We consider functions in the span of a bounded orthonormal system, i.e.,
$$\displaystyle{f(t) =\sum _{ j=1}^{N}x_{ j}\psi _{j}(t),}$$
and we assume that the coefficient vector \(\mathbf{x}\,\in \,{\mathbb{C}}^{N}\) is sparse. The task is to reconstruct f (or equivalently \(\mathbf{x}\)) from sample values at locations t 1, …, t m , namely,
$$\displaystyle{y_{k} = f(t_{k}) =\sum _{ j=1}^{N}x_{ j}\psi _{j}(t_{k}).}$$
Introducing the sampling matrix \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\) with entries
$$\displaystyle{ A_{j,k} =\psi _{j}(t_{k}), }$$
(1.19)
the vector of samples is given by \(\mathbf{y} = \mathbf{A}\mathbf{x}\). We are back to the standard compressive sensing problem with a matrix \(\mathbf{A}\) taking this particular form. Randomness enters the picture by way of the sampling locations t 1, , t m which are chosen independently at random according to the probability measure ν. This makes \(\mathbf{A}\) a structured random matrix. Before studying its performance, we relate this sampling setup with discrete uncertainty principles and establish performance limitations. In the context of the Hadamard transform, in slight contrast to (1.18), we show that now at least mC slnN measurements are necessary.

Deriving recovery guarantees for the random sampling matrix \(\mathbf{A}\) in (1.19) is more involved than for subgaussian random matrices where all the entries are independent. In fact, the matrix \(\mathbf{A}\) has mN entries, but it is generated only by m independent random variables. We proceed by increasing level of difficulty and start by showing nonuniform sparse recovery guarantees for 1-minimization. The number of samples allowing one to recover a fixed s-sparse coefficient vector \(\mathbf{x}\) with high probability is then mC K 2 slnN.

The bound for the restricted isometry constants of the random sampling matrix \(\mathbf{A}\) in (1.19) is a highlight of the theory of compressive sensing. It states that δ s δ with high probability provided
$$\displaystyle{m \geq C{K{}^{2}\delta }^{-2}{s\ln }^{4}(N).}$$
We close  Chap. 12 by illustrating some connections to the Λ 1-problem from harmonic analysis.
A further type of measurement matrix used in compressive sensing is considered in  Chap. 13. It arises as the adjacency matrix of certain bipartite graphs called lossless expanders. Hence, its entries take only the values 0 and 1. The existence of lossless expanders with optimal parameters is shown via probabilistic (combinatorial) arguments. We then show that the m ×N adjacency matrix of a lossless expander allows for uniform recovery of all s-sparse vectors via 1-minimization provided that
$$\displaystyle{m \geq Cs\ln (N/s).}$$
Moreover, we present two iterative reconstruction algorithms. One of them has the remarkable feature that its runtime is sublinear in the signal length N; more precisely, its execution requires \(\mathcal{O}({s{}^{2}\ln }^{3}N)\) operations. Since only the locations and the values of s nonzero entries need to be identified, such superfast algorithms are not implausible. In fact, sublinear algorithms are possible in other contexts, too, but they are always designed together with the measurement matrix \(\mathbf{A}\).
In  Chap. 14, we follow a different approach to sparse recovery guarantees by considering a fixed (deterministic) matrix \(\mathbf{A}\) and choosing the s-sparse vector \(\mathbf{x}\) at random. More precisely, we select its support set S uniformly at random among all subsets of [N] = { 1, 2, …, N} with cardinality s. The signs of the nonzero coefficients of \(\mathbf{x}\) are chosen at random as well, but their magnitudes are kept arbitrary. Under a very mild condition on the coherence μ of \(\mathbf{A} \in {\mathbb{C}}^{m\times N}\), namely,
$$\displaystyle{ \mu \leq \frac{c} {\ln N}, }$$
(1.20)
and under the condition
$$\displaystyle{ \frac{s\|\mathbf{A}\|_{2\rightarrow 2}} {N} \leq \frac{c} {\ln N}, }$$
(1.21)
the vector \(\mathbf{x}\) is recoverable from \(\mathbf{y} = \mathbf{A}\mathbf{x}\) via 1-minimization with high probability. The (deterministic or random) matrices \(\mathbf{A}\) usually used in compressive sensing and signal processing, for instance, tight frames, obey (1.21) provided
$$\displaystyle{ m \geq Cs\ln N. }$$
(1.22)
Since (1.20) is also satisfied for these matrices, we again obtain sparse recovery in the familiar parameter regime (1.22). The analysis relies on the crucial fact that a random column submatrix of \(\mathbf{A}\) is well conditioned under (1.20) and (1.21). We note that this random signal model may not always reflect the type of signals encountered in practice, so the theory for random matrices remains important. Nevertheless, the result for random signals explains the outcome of numerical experiments where the signals are often constructed at random.
The 1-minimization principle (basis pursuit) is one of the most powerful sparse recovery methods—as should have become clear by now. Chapter  15 presents a selection of efficient algorithms to perform this optimization task in practice (the selection is nonexhaustive, and the algorithms have been chosen not only for their efficiency but also for their simplicity and diversity). First, the homotopy method applies to the real-valued case \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\), \(\mathbf{y} \in {\mathbb{R}}^{m}\). For a parameter λ > 0, we consider the functional
$$\displaystyle{F_{\lambda }(\mathbf{x}) = \frac{1} {2}\|\mathbf{A}\mathbf{x} -\mathbf{y}\|_{2}^{2} +\lambda \| \mathbf{x}\|_{ 1}.}$$
Its minimizer \(\mathbf{x}_{\lambda }\) converges to the minimizer \({\mathbf{x}}^{\sharp }\) of the equality-constrained 1-minimization problem (1.2). The map \(\lambda \mapsto \mathbf{x}_{\lambda }\) turns out to be piecewise linear. The homotopy method starts with a sufficiently large λ, for which \(\mathbf{x}_{\lambda } = \mathbf{0}\), and traces the endpoints of the linear pieces until λ = 0 + , for which \(\mathbf{x}_{\lambda } ={ \mathbf{x}}^{\sharp }\). At each step of the algorithm, an element is added or removed from the support set of the current minimizer. Since one mostly adds elements to the support, this algorithm is usually very efficient for small sparsity.

As a second method, we treat Chambolle and Pock’s primal–dual algorithm. This algorithm applies to a large class of optimization problems including 1-minimization. It consists of a simple iterative procedure which updates a primal, a dual, and an auxiliary variable at each step. All of the computations are easy to perform. We show convergence of the sequence of primal variables generated by the algorithm to the minimizer of the given functional and outline its specific form for three types of 1-minimization problems. In contrast to the homotopy method, it applies also in the complex-valued case.

Finally, we discuss a method that iteratively solves weighted 2-minimization problems. The weights are suitably updated in each iteration based on the solution of the previous iteration. Since weighted 2-minimization can be performed efficiently (in fact, this is a linear problem), each step of the algorithm can be computed quickly. Although this algorithm is strongly motivated by 1-minimization, its convergence to the 1-minimizer is not guaranteed. Nevertheless, under the null space property of the matrix \(\mathbf{A}\) (equivalent to sparse recovery via 1-minimization), we show that the iteratively reweighted least squares algorithm recovers every s-sparse vector from \(\mathbf{y} = \mathbf{A}\mathbf{x}\). Recovery is stable when passing to compressible vectors. Moreover, we give an estimate of the convergence rate in the exactly sparse case.

The book is concluded with three appendices. Appendix A covers background material from linear algebra and matrix analysis, including vector and matrix norms, eigenvalues and singular values, and matrix functions. Basic concepts and results from convex analysis and convex optimization are presented in Appendix B. We also treat matrix convexity and present a proof of Lieb’s theorem on the concavity of the matrix function \(\mathbf{X}\mapsto \mathrm{tr\,}\exp (\mathbf{H} +\ln \mathbf{X})\) on the set of positive definite matrices. Appendix C presents miscellaneous material including covering numbers, Fourier transforms, elementary estimates on binomial coefficients, the Gamma function and Stirling’s formula, smoothing of Lipschitz functions via convolution, distributional derivatives, and differential inequalities.

Notation is usually introduced when it first appears. Additionally, a collection of symbols used in the text can be found on pp. 589. All the constants in this book are universal unless stated otherwise. This means that they do not depend on any other quantity. Often, the value of a constant is given explicitly or it can be deduced from the proof.

1.4 Notes

The field of compressive sensing was initiated with the papers [94] by Candès, Romberg, and Tao and [152] by Donoho who coined the term compressed sensing. Even though there have been predecessors on various aspects of the field, these papers seem to be the first ones to combine the ideas of 1-minimization with a random choice of measurement matrix and to realize the effectiveness of this combination for solving underdetermined systems of equations. Also, they emphasized the potential of compressive sensing for many signal processing tasks.

We now list some of the highlights from preceding works and earlier developments connected to compressive sensing. Details and references on the advances of compressive sensing itself will be given in the Notes sections at the end of each subsequent chapter. References [29, 84, 100, 182, 204, 411, 427] provide overview articles on compressive sensing.

Arguably, the first contribution connected to sparse recovery was made by de Prony [402] as far back as 1795. He developed a method for identifying the frequencies \(\omega _{j} \in \mathbb{R}\) and the amplitudes \(x_{j} \in \mathbb{C}\) in a nonharmonic trigonometric sum of the form \(f(t) =\sum _{ j=1}^{s}x_{j}{e}^{2\pi i\omega _{j}t}\). His method takes equidistant samples and solves an eigenvalue problem to compute the ω j . This method is related to Reed–Solomon decoding covered in the next chapter; see Theorem 2.15. For more information on the Prony method, we refer to [344, 401].

The use of 1-minimization appeared in the 1965 Ph.D. thesis [332] of Logan in the context of sparse frequency estimation, and an early theoretical work on L 1-minimization is the paper [161] by Donoho and Logan. Geophysicists observed in the late 1970s that 1-minimization can be successfully used to compute a sparse reflection function indicating changes between subsurface layers [469, 441]. The use of total-variation minimization, which is closely connected to 1-minimization, appeared in the 1990s in the work on image processing by Rudin, Osher, and Fatemi [436]. The use of 1-minimization and related greedy methods in statistics was greatly popularized by the work of Tibshirani [473] on the LASSO (Least Absolute Shrinkage and Selection Operator).

The theory of sparse approximation and associated algorithms began in the 1990s with the papers [342, 359, 114]. The theoretical understanding of conditions allowing greedy methods and 1-minimization to recover the sparsest solution developed with the work in [158, 181, 155, 239, 224, 215, 476, 479].

Compressive sensing has connections with the area of information-based complexity which considers the general question of how well functions f from a class \(\mathcal{F}\) can be approximated from m sample values or more generally from the evaluation of m linear or nonlinear functionals applied to f; see [474]. The optimal recovery error defined as the maximal reconstruction error for the best sampling and recovery methods over all functions in the class \(\mathcal{F}\) is closely related to the so-called Gelfand width of \(\mathcal{F}\) [370]; see also  Chap. 10. Of particular interest in compressive sensing is the 1-ball B 1 N in \({\mathbb{R}}^{N}\). Famous results due to Kashin [299] and Gluskin and Garnaev [219, 227] sharply bound the Gelfand widths of B 1 N from above and below; see also  Chap. 10. Although the original interest of Kashin was to estimate m-widths of Sobolev classes, these results give precise performance bounds on how well any method may recover (approximately) sparse vectors from linear measurements. It is remarkable that [299, 219] already employed Bernoulli and Gaussian random matrices in ways similar to their use in compressive sensing (see  Chap. 9).

In computer science, too, sparsity appeared before the advent of compressive sensing through the area of sketching. Here, one is not only interested in recovering huge data sets (such as data streams on the Internet) from vastly undersampled data, but one requires in addition that the associated algorithms have sublinear runtime in the signal length. There is no a priori contradiction in this desideratum because one only needs to report locations and values of nonzero entries. Such algorithms often use ideas from group testing [173], which dates back to World War II, when Dorfman [171] devised an efficient method for detecting draftees with syphilis. One usually designs the matrix and the fast algorithm simultaneously [131, 225] in this setup. Lossless expanders as studied in  Chap. 13 play a key role in some of the constructions [41]. Quite remarkably, sublinear algorithms are also available for sparse Fourier transforms [223, 519, 287, 288, 262, 261].

Applications of Compressive Sensing. We next provide comments and references on the applications and motivations described in Sect. 1.2.

Single-pixel camera. The single-pixel camera was developed by Baraniuk and coworkers [174] as an elegant proof of concept that the ideas of compressive sensing can be implemented in hardware.

Magnetic resonance imaging. The initial paper [94] on compressive sensing was motivated by medical imaging—although Candès et al. have in fact treated the very similar problem of computerized tomography. The application of compressive sensing techniques to magnetic resonance imaging (MRI) was investigated in [338, 255, 497, 358]. Background on the theoretical foundations of MRI can be found, for instance, in [252, 267, 512]. Applications of compressive sensing to the related problem of nuclear magnetic resonance spectroscopy are contained in [278, 447]. Background on the methods related to Fig. 1.6 is described in the work of Lustig, Vasanawala and coworkers [497, 358].

Radar. The particular radar application outlined in Sect. 1.2 is described in more detail in [268]. The same mathematical model appears also in sonar and in the channel estimation problem of wireless communications [384, 412, 385]. The application of compressive sensing to other radar scenarios can be found, for instance, in [185, 189, 397, 455, 283].

Sampling theory. The classical sampling theorem (1.5) can be associated with the names of Shannon, Nyquist, Whittaker, and Kotelnikov. Sampling theory is a broad and well-developed area. We refer to [39, 195, 271, 272, 294] for further information on the classical aspects. The use of sparse recovery techniques in sampling problems appeared early in the development of the compressive sensing theory [94, 97, 408, 409, 411, 416]. In fact, the alternative name compressive sampling indicates that compressive sensing can be viewed as a part of sampling theory—although it draws from quite different mathematical tools than classical sampling theory itself.

Sparse approximation. The theory of compressive sensing can also be viewed as a part of sparse approximation with roots in signal processing, harmonic analysis [170], and numerical analysis [122]. A general source for background on sparse approximation and its applications are the books [179, 451, 472] as well as the survey paper [73].

The principle of representing a signal by a small number of terms in a suitable basis in order to achieve compression is realized, for instance, in the ubiquitous compression standards JPEG, MPEG, and MP3. Wavelets [137] are known to provide a good basis for images, and the analysis of the best (nonlinear) approximation reaches into the area of function spaces, more precisely Besov spaces [508]. Similarly, Gabor expansions [244] may compress audio signals. Since good Gabor systems are always redundant systems (frames) and never bases, computational tools to compute the sparsest representation of a signal are essential. It was realized in [359, 342] that this problem is in general NP-hard. The greedy approach via orthogonal matching pursuit was then introduced in [342] (although it had appeared earlier in different contexts), while basis pursuit ( 1-minimization) was introduced in [114].

The use of the uncertainty principle for deducing a positive statement on the data separation problem with respect to the Fourier and canonical bases appeared in [164, 163]. For further information on the separation problem, we refer the reader to [181, 92, 158, 160, 238, 331, 482]. Background on denoising via sparse representations can be found in [180, 450, 105, 150, 159, 407].

The analysis of conditions allowing algorithms such as 1-minimization or orthogonal matching pursuit to recover the sparsest representation has started with the contributions [155, 157, 158, 156, 224, 476, 479], and these early results are the basis for the advances in compressive sensing.

Error correction. The idealized setup of error correction and the compressive sensing approach described in Sect. 1.2 appeared in [96, 167, 431]. For more background on error correction, we refer to [282].

Statistics and machine learning. Sparsity has a long history in statistics and in linear regression models in particular. The corresponding area is sometimes referred to as high-dimensional statistics or model selection because the support set of the coefficient vector \(\mathbf{x}\) determines the relevant explanatory variables and thereby selects a model. Stepwise forward regression methods are closely related to greedy algorithms such as (orthogonal) matching pursuit. The LASSO, i.e., the minimization problem (1.8), was introduced by Tibshirani in [473]. Candès and Tao have introduced the Dantzig selector (1.9) in [98] and realized that methods of compressive sensing (the restricted isometry property) are useful for the analysis of sparse regression methods. We refer to [48] and the monograph [76] for details. For more information on machine learning, we direct the reader to [18, 133, 134, 444]. Connections between sparsity and machine learning can be found, for instance, in [23, 147, 513].

Low-rank matrix recovery. The extension of compressive sensing to the recovery of low-rank matrices from incomplete information emerged with the papers [90, 99, 418]. The idea of replacing the rank minimization problem by the nuclear norm minimization appeared in the Ph.D. thesis of Fazel [190]. The matrix completion problem is treated in [90, 417, 99] and the more general problem of quantum state tomography in [246, 245, 330].

Let us briefly mention further applications and relations to other fields.

In inverse problems, sparsity has also become an important concept for regularization methods. Instead of Tikhonov regularization with a Hilbert space norm [186], one uses an 1-norm regularization approach [138, 406]. In many practical applications, this improves the recovered solutions. Ill-posed inverse problems appear, for instance, in geophysics where 1-norm regularization was already used in [469, 441] but without rigorous mathematical theory at that time. We refer to the survey papers [269, 270] dealing with compressive sensing in seismic exploration.

Total-variation minimization is a classical and successful approach for image denoising and other tasks in image processing [106, 436, 104]. Since the total variation is the 1-norm of the gradient, the minimization problem is closely related to basis pursuit. In fact, the motivating example for the first contribution [94] of Candès, Romberg, and Tao to compressive sensing came from total-variation minimization in computer tomography. The restricted isometry property can be used to analyze image recovery via total-variation minimization [364]. The primal–dual algorithm of Chambolle and Pock to be presented in  Chap. 15 was originally motivated by total-variation minimization as well [107].

Further applications of compressive sensing and sparsity in general include imaging (tomography, ultrasound, photoacoustic imaging, hyperspectral imaging, etc.), analog-to-digital conversion [488, 353], DNA microarray processing, astronomy [507], and wireless communications [27, 468].

Topics not Covered in this Book. It is impossible to give a detailed account of all the directions that have so far cropped up around compressive sensing. This book certainly makes a selection, but we believe that we cover the most important aspects and mathematical techniques. With this basis, the reader should be well equipped to read the original references on further directions, generalizations, and applications. Let us only give a brief account of additional topics together with the relevant references. Again, no claim about completeness of the list is made.

Structured sparsity models. One often has additional a priori knowledge than just pure sparsity in the sense that the support set of the sparse vector to be recovered possesses a certain structure, i.e., only specific support sets are allowed. Let us briefly describe the joint-sparsity and block-sparsity model.

Suppose that we take measurements not only of a single signal but of a collection of signals that are somewhat coupled. Rather than assuming that each signal is sparse (or compressible) on its own, we assume that the unknown support set is the same for all signals in the collection. In this case, we speak of joint sparsity. A motivating example is color images where each signal corresponds to a color channel of the image, say red, green, and blue. Since edges usually appear at the same location for all channels, the gradient features some joint sparsity. Instead of the usual 1-minimization problem, one considers mixed 1 2-norm minimization or greedy algorithms exploiting the joint-sparsity structure. A similar setup is described by the block-sparsity (or group-sparsity) model, where certain indices of the sparse vector are grouped together. Then a signal is block sparse if most groups (blocks) of coefficients are zero. In other words, nonzero coefficients appear in groups. Recovery algorithms may exploit this prior knowledge to improve the recovery performance. A theory can be developed along similar lines as usual sparsity [143, 183, 184, 203, 487, 241, 478]. The so-called model-based compressive sensing [30] provides a further, very general structured sparsity setup.

Sublinear algorithms. This type of algorithms have been developed in computer science for a longer time. The fact that only the locations and values of nonzero entries of a sparse vector have to be reported enables one to design recovery algorithms whose runtime is sublinear in the vector length. Recovery methods are also called streaming algorithms or heavy hitters. We will only cover a toy sublinear algorithm in  Chap. 13, and we refer to [41, 131, 223, 289, 285, 222, 225, 261] for more information.

Connection with the geometry of random polytopes. Donoho and Tanner [166, 165, 154, 167] approached the analysis of sparse recovery via 1-minimization through polytope geometry. In fact, the recovery of s-sparse vectors via 1-minimization is equivalent to a geometric property—called neighborliness—of the projected 1-ball under the action of the measurement matrix; see also Corollary 4.39. When the measurement matrix is a Gaussian random matrix, Donoho and Tanner give a precise analysis of so-called phase transitions that predict in which ranges of (s, m, N) sparse recovery is successful and unsuccessful with high probability. In particular, their analysis provides the value of the optimal constant C such that mC sln(Ns) allows for s-sparse recovery via 1-minimization. We only give a brief account of their work in the Notes of  Chap. 9.

Compressive sensing and quantization. If compressive sensing is used for signal acquisition, then a realistic sensor must quantize the measured data. This means that only a finite number of values for the measurements y are possible. For instance, 8 bits provide 28 = 256 values for an approximation of y to be stored. If the quantization is coarse, then this additional source of error cannot be ignored and a revised theoretical analysis becomes necessary. We refer to [316, 520, 249] for background information. We also mention the extreme case of 1-bit compressed sensing where only the signs of the measurements are available via \(\mathbf{y} =\mathrm{ sgn}(\mathbf{A}\mathbf{x})\) [290, 393, 394].

Dictionary learning. Sparsity usually occurs in a specific basis or redundant dictionary. In certain applications, it may not be immediately clear which dictionary is suitable to sparsify the signals of interest. Dictionary learning tries to identify a good dictionary using training signals. Algorithmic approaches include the K-SVD algorithm [5, 429] and optimization methods [242]. Optimizing over both the dictionary and the coefficients in the expansions results in a nonconvex program, even when using 1-minimization. Therefore, it is notoriously hard to establish a rigorous mathematical theory of dictionary learning despite the fact that the algorithms perform well in practice. Nevertheless, there are a few interesting mathematical results available in the spirit of compressive sensing [221, 242].

Recovery of functions of many variables. Techniques from compressive sensing can be exploited for the reconstruction of functions on a high-dimensional space from point samples. Traditional approaches suffer the curse of dimensionality, which predicts that the number of samples required to achieve a certain reconstruction accuracy scales exponentially with the spatial dimension even for classes of infinitely differentiable functions [371, 474]. It is often a reasonable assumption in practice that the function to be reconstructed depends only on a small number of (a priori unknown) variables. This model is investigated in [125, 149], and ideas of compressive sensing allow one to dramatically reduce the number of required samples. A more general model considers functions of the form \(f(\mathbf{x}) = g(\mathbf{A}\mathbf{x})\), where \(\mathbf{x}\) belongs to a subset \(\mathcal{D}\subset {\mathbb{R}}^{N}\) with N being large, \(\mathbf{A} \in {\mathbb{R}}^{m\times N}\) with mN, and g is a smooth function on an m-dimensional domain. Both g and \(\mathbf{A}\) are unknown a priori and are to be reconstructed from suitable samples of f. Again, under suitable assumptions on g and on \(\mathbf{A}\), one can build on methods from compressive sensing to recover f from a relatively small number of samples. We refer to [266, 124, 206] for details.

Hints for Preparing a Course. This book can be used for a course on compressive sensing at the graduate level. Although the whole material exceeds what can be reasonably covered in a one-semester class, properly selected topics do convert into self-contained components. We suggest the following possibilities:
  • For a comprehensive treatment of the deterministic issues, Chaps.  2 6 complemented by  Chap. 10 are appropriate. If a proof of the restricted isometry property for random matrices is desired, one can add the simple arguments of Sect.  9.1, which only rely on a few tools from  Chap. 7. In a class lasting only one quarter rather than one semester, one can remove Sect.  4.5 and mention only briefly the stability and robustness results of Chaps.  4 and  6. One can also concentrate only on 1-minimization and discard  Chap. 3 as well as Sects.  5.3, 5.5, 6.3, and 6.4 if the variety of algorithms is not a priority.

  • On the other hand, for a course focusing on algorithmic aspects, Chaps.  2 6 as well as (parts of)  Chap. 15 are appropriate, possibly replacing  Chap. 5 by  Chap. 13 and including (parts of) Appendix B.

  • For a course focusing on probabilistic issues, we recommend Chaps.  7 9 and Chaps. 11, 12, and 14. This can represent a second one-semester class. However, if this material has to be delivered as a first course,  Chap. 4 (especially Sects.  4.1 and  4.4) and  Chap. 6 (especially Sects.  6.1 and  6.2) need to be included.

Of course, parts of particular chapters may also be dropped depending on the desired emphasis.

We will be happy to receive feedback on these suggestions from instructors using this book in their class. They may also contact us to obtain typed-out solutions for some of the exercises.

References

  1. 5.
    M. Aharon, M. Elad, A. Bruckstein, The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)CrossRefGoogle Scholar
  2. 18.
    M. Anthony, P. Bartlett, Neural Network Learning: Theoretical Foundations (Cambridge University Press, Cambridge, 1999)MATHCrossRefGoogle Scholar
  3. 23.
    F. Bach, R. Jenatton, J. Mairal, G. Obozinski, Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)CrossRefGoogle Scholar
  4. 27.
    W. Bajwa, J. Haupt, A.M. Sayeed, R. Nowak, Compressed channel sensing: a new approach to estimating sparse multipath channels. Proc. IEEE 98(6), 1058–1076 (June 2010)CrossRefGoogle Scholar
  5. 29.
    R.G. Baraniuk, Compressive sensing. IEEE Signal Process. Mag. 24(4), 118–121 (2007)CrossRefGoogle Scholar
  6. 30.
    R.G. Baraniuk, V. Cevher, M. Duarte, C. Hedge, Model-based compressive sensing. IEEE Trans. Inform. Theor. 56, 1982–2001 (April 2010)CrossRefGoogle Scholar
  7. 39.
    J.J. Benedetto, P.J.S.G. Ferreira (eds.), Modern Sampling Theory: Mathematics and Applications. Applied and Numerical Harmonic Analysis (Birkhäuser, Boston, MA, 2001)Google Scholar
  8. 41.
    R. Berinde, A. Gilbert, P. Indyk, H. Karloff, M. Strauss, Combining geometry and combinatorics: A unified approach to sparse signal recovery. In Proc. of 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 798–805, 2008Google Scholar
  9. 48.
    P. Bickel, Y. Ritov, A. Tsybakov, Simultaneous analysis of lasso and Dantzig selector. Ann. Stat. 37(4), 1705–1732 (2009)MathSciNetMATHCrossRefGoogle Scholar
  10. 73.
    A. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev. 51(1), 34–81 (2009)MathSciNetMATHCrossRefGoogle Scholar
  11. 76.
    P. Bühlmann, S. van de Geer, Statistics for High-dimensional Data. Springer Series in Statistics (Springer, Berlin, 2011)Google Scholar
  12. 84.
    E.J. Candès, Compressive sampling. In Proceedings of the International Congress of Mathematicians, Madrid, Spain, 2006Google Scholar
  13. 90.
    E.J. Candès, B. Recht, Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2009)MathSciNetMATHCrossRefGoogle Scholar
  14. 92.
    E.J. Candès, J. Romberg, Quantitative robust uncertainty principles and optimally sparse decompositions. Found. Comput. Math. 6(2), 227–254 (2006)MathSciNetMATHCrossRefGoogle Scholar
  15. 94.
    E.J. Candès, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theor. 52(2), 489–509 (2006)MATHCrossRefGoogle Scholar
  16. 96.
    E.J. Candès, T. Tao, Decoding by linear programming. IEEE Trans. Inform. Theor. 51(12), 4203–4215 (2005)MATHCrossRefGoogle Scholar
  17. 97.
    E.J. Candès, T. Tao, Near optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inform. Theor. 52(12), 5406–5425 (2006)CrossRefGoogle Scholar
  18. 98.
    E.J. Candès, T. Tao, The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313–2351, (2007)Google Scholar
  19. 99.
    E.J. Candès, T. Tao, The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inform. Theor. 56(5), 2053–2080 (2010)CrossRefGoogle Scholar
  20. 100.
    E.J. Candès, M. Wakin, An introduction to compressive sampling. IEEE Signal Process. Mag. 25(2), 21–30 (2008)CrossRefGoogle Scholar
  21. 104.
    A. Chambolle, V. Caselles, D. Cremers, M. Novaga, T. Pock, An introduction to total variation for image analysis. In Theoretical Foundations and Numerical Methods for Sparse Recovery, ed. by M. Fornasier. Radon Series on Computational and Applied Mathematics, vol. 9 (de Gruyter, Berlin, 2010), pp. 263–340Google Scholar
  22. 105.
    A. Chambolle, R.A. DeVore, N.-Y. Lee, B.J. Lucier, Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans. Image Process. 7(3), 319–335 (1998)MathSciNetMATHCrossRefGoogle Scholar
  23. 106.
    A. Chambolle, P.-L. Lions, Image recovery via total variation minimization and related problems. Numer. Math. 76(2), 167–188 (1997)MathSciNetMATHCrossRefGoogle Scholar
  24. 107.
    A. Chambolle, T. Pock, A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis. 40, 120–145 (2011)MathSciNetMATHCrossRefGoogle Scholar
  25. 114.
    S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by Basis Pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1999)MathSciNetMATHCrossRefGoogle Scholar
  26. 122.
    A. Cohen, Numerical Analysis of Wavelet Methods (North-Holland, Amsterdam, 2003)MATHGoogle Scholar
  27. 124.
    A. Cohen, I. Daubechies, R. DeVore, G. Kerkyacharian, D. Picard, Capturing Ridge Functions in High Dimensions from Point Queries. Constr. Approx. 35, 225–243 (2012)MathSciNetMATHCrossRefGoogle Scholar
  28. 125.
    A. Cohen, R. DeVore, S. Foucart, H. Rauhut, Recovery of functions of many variables via compressive sensing. In Proc. SampTA 2011, Singapore, 2011Google Scholar
  29. 131.
    G. Cormode, S. Muthukrishnan, Combinatorial algorithms for compressed sensing. In CISS, Princeton, 2006Google Scholar
  30. 133.
    F. Cucker, S. Smale, On the mathematical foundations of learning. Bull. Am. Math. Soc., New Ser. 39(1), 1–49 (2002)Google Scholar
  31. 134.
    F. Cucker, D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint. Cambridge Monographs on Applied and Computational Mathematics (Cambridge University Press, Cambridge, 2007)Google Scholar
  32. 137.
    I. Daubechies, Ten Lectures on Wavelets, CBMS-NSF Regional Conference Series in Applied Mathematics, vol. 61 (SIAM, Philadelphia, 1992)Google Scholar
  33. 138.
    I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 57(11), 1413–1457 (2004)MathSciNetMATHCrossRefGoogle Scholar
  34. 143.
    M. Davies, Y. Eldar, Rank awareness in joint sparse recovery. IEEE Trans. Inform. Theor. 58(2), 1135–1146 (2012)MathSciNetCrossRefGoogle Scholar
  35. 147.
    C. De Mol, E. De Vito, L. Rosasco, Elastic-net regularization in learning theory. J. Complex. 25(2), 201–230 (2009)MATHCrossRefGoogle Scholar
  36. 149.
    R.A. DeVore, G. Petrova, P. Wojtaszczyk, Approximation of functions of few variables in high dimensions. Constr. Approx. 33(1), 125–143 (2011)MathSciNetMATHCrossRefGoogle Scholar
  37. 150.
    D.L. Donoho, De-noising by soft-thresholding. IEEE Trans. Inform. Theor. 41(3), 613–627 (1995)MathSciNetMATHCrossRefGoogle Scholar
  38. 152.
    D.L. Donoho, Compressed sensing. IEEE Trans. Inform. Theor. 52(4), 1289–1306 (2006)MathSciNetCrossRefGoogle Scholar
  39. 154.
    D.L. Donoho, High-dimensional centrally symmetric polytopes with neighborliness proportional to dimension. Discrete Comput. Geom. 35(4), 617–652 (2006)MathSciNetMATHCrossRefGoogle Scholar
  40. 155.
    D.L. Donoho, M. Elad, Optimally sparse representations in general (non-orthogonal) dictionaries via 1 minimization. Proc. Nat. Acad. Sci. 100(5), 2197–2202 (2003)MathSciNetMATHCrossRefGoogle Scholar
  41. 156.
    D.L. Donoho, M. Elad, On the stability of the basis pursuit in the presence of noise. Signal Process. 86(3), 511–532 (2006)MATHCrossRefGoogle Scholar
  42. 157.
    D.L. Donoho, M. Elad, V.N. Temlyakov, Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theor. 52(1), 6–18 (2006)MathSciNetCrossRefGoogle Scholar
  43. 158.
    D.L. Donoho, X. Huo, Uncertainty principles and ideal atomic decompositions. IEEE Trans. Inform. Theor. 47(7), 2845–2862 (2001)MathSciNetMATHCrossRefGoogle Scholar
  44. 159.
    D.L. Donoho, I.M. Johnstone, Minimax estimation via wavelet shrinkage. Ann. Stat. 26(3), 879–921 (1998)MathSciNetMATHCrossRefGoogle Scholar
  45. 160.
    D.L. Donoho, G. Kutyniok, Microlocal analysis of the geometric separation problem. Comm. Pure Appl. Math. 66(1), 1–47 (2013)MathSciNetMATHCrossRefGoogle Scholar
  46. 161.
    D.L. Donoho, B. Logan, Signal recovery and the large sieve. SIAM J. Appl. Math. 52(2), 577–591 (1992)MathSciNetMATHCrossRefGoogle Scholar
  47. 163.
    D.L. Donoho, P. Stark, Recovery of a sparse signal when the low frequency information is missing. Technical report, Department of Statistics, University of California, Berkeley, June 1989Google Scholar
  48. 164.
    D.L. Donoho, P. Stark, Uncertainty principles and signal recovery. SIAM J. Appl. Math. 48(3), 906–931 (1989)MathSciNetCrossRefGoogle Scholar
  49. 165.
    D.L. Donoho, J. Tanner, Neighborliness of randomly projected simplices in high dimensions. Proc. Natl. Acad. Sci. USA 102(27), 9452–9457 (2005)MathSciNetMATHCrossRefGoogle Scholar
  50. 166.
    D.L. Donoho, J. Tanner, Sparse nonnegative solutions of underdetermined linear equations by linear programming. Proc. Natl. Acad. Sci. 102(27), 9446–9451 (2005)MathSciNetCrossRefGoogle Scholar
  51. 167.
    D.L. Donoho, J. Tanner, Counting faces of randomly-projected polytopes when the projection radically lowers dimension. J. Am. Math. Soc. 22(1), 1–53 (2009)MathSciNetMATHCrossRefGoogle Scholar
  52. 170.
    D.L. Donoho, M. Vetterli, R.A. DeVore, I. Daubechies, Data compression and harmonic analysis. IEEE Trans. Inform. Theor. 44(6), 2435–2476 (1998)MathSciNetMATHCrossRefGoogle Scholar
  53. 171.
    R. Dorfman, The detection of defective members of large populations. Ann. Stat. 14, 436–440 (1943)CrossRefGoogle Scholar
  54. 173.
    D.-Z. Du, F. Hwang, Combinatorial Group Testing and Its Applications (World Scientific, Singapore, 1993)MATHGoogle Scholar
  55. 174.
    M. Duarte, M. Davenport, D. Takhar, J. Laska, S. Ting, K. Kelly, R.G. Baraniuk, Single-Pixel Imaging via Compressive Sampling. IEEE Signal Process. Mag. 25(2), 83–91 (2008)CrossRefGoogle Scholar
  56. 179.
    M. Elad, Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing (Springer, New York, 2010)CrossRefGoogle Scholar
  57. 180.
    M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736 –3745 (2006)MathSciNetCrossRefGoogle Scholar
  58. 181.
    M. Elad, A.M. Bruckstein, A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theor. 48(9), 2558–2567 (2002)MathSciNetMATHCrossRefGoogle Scholar
  59. 182.
    Y. Eldar, G. Kutyniok (eds.), Compressed Sensing: Theory and Applications (Cambridge University Press, New York, 2012)Google Scholar
  60. 183.
    Y. Eldar, M. Mishali, Robust recovery of signals from a structured union of subspaces. IEEE Trans. Inform. Theor. 55(11), 5302–5316 (2009)MathSciNetCrossRefGoogle Scholar
  61. 184.
    Y. Eldar, H. Rauhut, Average case analysis of multichannel sparse recovery using convex relaxation. IEEE Trans. Inform. Theor. 56(1), 505–519 (2010)MathSciNetCrossRefGoogle Scholar
  62. 185.
    J. Ender, On compressive sensing applied to radar. Signal Process. 90(5), 1402–1414 (2010)MATHCrossRefGoogle Scholar
  63. 186.
    H.W. Engl, M. Hanke, A. Neubauer, Regularization of Inverse Problems (Springer, New York, 1996)MATHCrossRefGoogle Scholar
  64. 189.
    A. Fannjiang, P. Yan, T. Strohmer, Compressed remote sensing of sparse objects. SIAM J. Imag. Sci. 3(3), 596–618 (2010)MathSciNetCrossRefGoogle Scholar
  65. 190.
    M. Fazel, Matrix Rank Minimization with Applications. PhD thesis, 2002Google Scholar
  66. 195.
    P.J.S.G. Ferreira, J.R. Higgins, The establishment of sampling as a scientific principle—a striking case of multiple discovery. Not. AMS 58(10), 1446–1450 (2011)MathSciNetMATHGoogle Scholar
  67. 203.
    M. Fornasier, H. Rauhut, Recovery algorithms for vector valued data with joint sparsity constraints. SIAM J. Numer. Anal. 46(2), 577–613 (2008)MathSciNetMATHCrossRefGoogle Scholar
  68. 204.
    M. Fornasier, H. Rauhut, Compressive sensing. In Handbook of Mathematical Methods in Imaging, ed. by O. Scherzer (Springer, New York, 2011), pp. 187–228CrossRefGoogle Scholar
  69. 206.
    M. Fornasier, K. Schnass, J. Vybiral, Learning Functions of Few Arbitrary Linear Parameters in High Dimensions. Found. Comput. Math. 12, 229–262 (2012)MathSciNetMATHCrossRefGoogle Scholar
  70. 215.
    J.J. Fuchs, On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theor. 50(6), 1341–1344 (2004)CrossRefGoogle Scholar
  71. 219.
    A. Garnaev, E. Gluskin, On widths of the Euclidean ball. Sov. Math. Dokl. 30, 200–204 (1984)MATHGoogle Scholar
  72. 221.
    Q. Geng, J. Wright, On the local correctness of 1-minimization for dictionary learning. Preprint (2011)Google Scholar
  73. 222.
    A. Gilbert, M. Strauss, Analysis of data streams. Technometrics 49(3), 346–356 (2007)MathSciNetCrossRefGoogle Scholar
  74. 223.
    A.C. Gilbert, S. Muthukrishnan, S. Guha, P. Indyk, M. Strauss, Near-Optimal Sparse Fourier Representations via Sampling. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pp. 152–161, ACM, New York, NY, USA, 2002Google Scholar
  75. 224.
    A.C. Gilbert, S. Muthukrishnan, M.J. Strauss, Approximation of functions over redundant dictionaries using coherence. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’03, pp. 243–252. SIAM, Philadelphia, PA, 2003Google Scholar
  76. 225.
    A.C. Gilbert, M. Strauss, J.A. Tropp, R. Vershynin, One sketch for all: fast algorithms for compressed sensing. In Proceedings of the Thirty-ninth Annual ACM Symposium on Theory of Computing, STOC ’07, pp. 237–246, ACM, New York, NY, USA, 2007Google Scholar
  77. 227.
    E. Gluskin, Norms of random matrices and widths of finite-dimensional sets. Math. USSR-Sb. 48, 173–182 (1984)MATHCrossRefGoogle Scholar
  78. 238.
    R. Gribonval, Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP’02), vol. 3, pp. 3057–3060, 2002Google Scholar
  79. 239.
    R. Gribonval, M. Nielsen, Sparse representations in unions of bases. IEEE Trans. Inform. Theor. 49(12), 3320–3325 (2003)MathSciNetCrossRefGoogle Scholar
  80. 241.
    R. Gribonval, H. Rauhut, K. Schnass, P. Vandergheynst, Atoms of all channels, unite! Average case analysis of multi-channel sparse recovery using greedy algorithms. J. Fourier Anal. Appl. 14(5), 655–687 (2008)MathSciNetMATHGoogle Scholar
  81. 242.
    R. Gribonval, K. Schnass, Dictionary identification—sparse matrix-factorisation via l 1-minimisation. IEEE Trans. Inform. Theor. 56(7), 3523–3539 (2010)MathSciNetCrossRefGoogle Scholar
  82. 244.
    K. Gröchenig, Foundations of Time-Frequency Analysis. Applied and Numerical Harmonic Analysis (Birkhäuser, Boston, MA, 2001)Google Scholar
  83. 245.
    D. Gross, Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inform. Theor. 57(3), 1548–1566 (2011)CrossRefGoogle Scholar
  84. 246.
    D. Gross, Y.-K. Liu, S.T. Flammia, S. Becker, J. Eisert, Quantum state tomography via compressed sensing. Phys. Rev. Lett. 105, 150401 (2010)CrossRefGoogle Scholar
  85. 249.
    C. Güntürk, M. Lammers, A. Powell, R. Saab, Ö. Yilmaz, Sobolev duals for random frames and ΣΔ quantization of compressed sensing measurements. Found. Comput. Math. 13(1), 1–36, Springer-Verlag (2013)Google Scholar
  86. 252.
    M. Haacke, R. Brown, M. Thompson, R. Venkatesan, Magnetic Resonance Imaging: Physical Principles and Sequence Design (Wiley-Liss, New York, 1999)Google Scholar
  87. 255.
    J. Haldar, D. Hernando, Z. Liang, Compressed-sensing MRI with random encoding. IEEE Trans. Med. Imag. 30(4), 893–903 (2011)CrossRefGoogle Scholar
  88. 261.
    H. Hassanieh, P. Indyk, D. Katabi, E. Price, Nearly optimal sparse Fourier transform. In Proceedings of the 44th Symposium on Theory of Computing, STOC ’12, pp. 563–578, ACM, New York, NY, USA, 2012Google Scholar
  89. 262.
    H. Hassanieh, P. Indyk, D. Katabi, E. Price, Simple and practical algorithm for sparse Fourier transform. In Proceedings of the Twenty-third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’12, pp. 1183–1194. SIAM, 2012Google Scholar
  90. 266.
    T. Hemant, V. Cevher, Learning non-parametric basis independent models from point queries via low-rank methods. Preprint (2012)Google Scholar
  91. 267.
    W. Hendee, C. Morgan, Magnetic resonance imaging Part I—Physical principles. West J. Med. 141(4), 491–500 (1984)Google Scholar
  92. 268.
    M. Herman, T. Strohmer, High-resolution radar via compressed sensing. IEEE Trans. Signal Process. 57(6), 2275–2284 (2009)MathSciNetCrossRefGoogle Scholar
  93. 269.
    F. Herrmann, M. Friedlander, O. Yilmaz, Fighting the curse of dimensionality: compressive sensing in exploration seismology. Signal Process. Mag. IEEE 29(3), 88–100 (2012)CrossRefGoogle Scholar
  94. 270.
    F. Herrmann, H. Wason, T. Lin, Compressive sensing in seismic exploration: an outlook on a new paradigm. CSEG Recorder 36(4), 19–33 (2011)Google Scholar
  95. 271.
    J.R. Higgins, Sampling Theory in Fourier and Signal Analysis: Foundations, vol. 1 (Clarendon Press, Oxford, 1996)MATHGoogle Scholar
  96. 272.
    J.R. Higgins, R.L. Stens, Sampling Theory in Fourier and Signal Analysis: Advanced Topics, vol. 2 (Oxford University Press, Oxford, 1999)MATHGoogle Scholar
  97. 278.
    D. Holland, M. Bostock, L. Gladden, D. Nietlispach, Fast multidimensional NMR spectroscopy using compressed sensing. Angew. Chem. Int. Ed. 50(29), 6548–6551 (2011)CrossRefGoogle Scholar
  98. 282.
    W. Huffman, V. Pless, Fundamentals of Error-correcting Codes (Cambridge University Press, Cambridge, 2003)MATHCrossRefGoogle Scholar
  99. 283.
    M. Hügel, H. Rauhut, T. Strohmer, Remote sensing via 1-minimization. Found. Comput. Math., to appear. (2012)Google Scholar
  100. 285.
    P. Indyk, A. Gilbert, Sparse recovery using sparse matrices. Proc. IEEE 98(6), 937–947 (2010)CrossRefGoogle Scholar
  101. 287.
    M. Iwen, Combinatorial sublinear-time Fourier algorithms. Found. Comput. Math. 10(3), 303–338 (2010)MathSciNetMATHCrossRefGoogle Scholar
  102. 288.
    M. Iwen, Improved approximation guarantees for sublinear-time Fourier algorithms. Appl. Comput. Harmon. Anal. 34(1), 57–82 (2013)MathSciNetMATHCrossRefGoogle Scholar
  103. 289.
    M. Iwen, A. Gilbert, M. Strauss, Empirical evaluation of a sub-linear time sparse DFT algorithm. Commun. Math. Sci. 5(4), 981–998 (2007)MathSciNetMATHGoogle Scholar
  104. 290.
    L. Jacques, J. Laska, P. Boufounos, R. Baraniuk, Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. IEEE Trans. Inform. Theor. 59(4), 2082–2102 (2013)MathSciNetCrossRefGoogle Scholar
  105. 294.
    A.J. Jerri, The Shannon sampling theorem—its various extensions and applications: A tutorial review. Proc. IEEE. 65(11), 1565–1596 (1977)MATHCrossRefGoogle Scholar
  106. 299.
    B. Kashin, Diameters of some finite-dimensional sets and classes of smooth functions. Math. USSR, Izv. 11, 317–333 (1977)Google Scholar
  107. 316.
    J. Laska, P. Boufounos, M. Davenport, R. Baraniuk, Democracy in action: quantization, saturation, and compressive sensing. Appl. Comput. Harmon. Anal. 31(3), 429–443 (2011)MathSciNetMATHCrossRefGoogle Scholar
  108. 330.
    Y. Liu, Universal low-rank matrix recovery from Pauli measurements. In NIPS, pp. 1638–1646, 2011Google Scholar
  109. 331.
    A. Llagostera Casanovas, G. Monaci, P. Vandergheynst, R. Gribonval, Blind audiovisual source separation based on sparse redundant representations. IEEE Trans. Multimed. 12(5), 358–371 (August 2010)CrossRefGoogle Scholar
  110. 332.
    B. Logan, Properties of High-Pass Signals. PhD thesis, Columbia University, New York, 1965Google Scholar
  111. 338.
    M. Lustig, D.L. Donoho, J. Pauly, Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 58(6), 1182–1195 (2007)CrossRefGoogle Scholar
  112. 342.
    S. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41(12), 3397–3415 (1993)MATHCrossRefGoogle Scholar
  113. 344.
    S. Marple, Digital Spectral Analysis with Applications (Prentice-Hall, Englewood Cliffs, 1987)Google Scholar
  114. 353.
    M. Mishali, Y.C. Eldar, From theory to practice: Sub-nyquist sampling of sparse wideband analog signals. IEEE J. Sel. Top. Signal Process. 4(2), 375–391 (April 2010)CrossRefGoogle Scholar
  115. 358.
    M. Murphy, M. Alley, J. Demmel, K. Keutzer, S. Vasanawala, M. Lustig, Fast 1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime. IEEE Trans. Med. Imag. 31(6), 1250–1262 (2012)CrossRefGoogle Scholar
  116. 359.
    B.K. Natarajan, Sparse approximate solutions to linear systems. SIAM J. Comput. 24, 227–234 (1995)MathSciNetMATHCrossRefGoogle Scholar
  117. 364.
    D. Needell, R. Ward, Stable image reconstruction using total variation minimization. Preprint (2012)Google Scholar
  118. 370.
    E. Novak, Optimal recovery and n-widths for convex classes of functions. J. Approx. Theor. 80(3), 390–408 (1995)MATHCrossRefGoogle Scholar
  119. 371.
    E. Novak, H. Woźniakowski, Tractability of Multivariate Problems. Vol. 1: Linear Information. EMS Tracts in Mathematics, vol. 6 (European Mathematical Society (EMS), Zürich, 2008)Google Scholar
  120. 384.
    G. Pfander, H. Rauhut, J. Tanner, Identification of matrices having a sparse representation. IEEE Trans. Signal Process. 56(11), 5376–5388 (2008)MathSciNetCrossRefGoogle Scholar
  121. 385.
    G. Pfander, H. Rauhut, J. Tropp, The restricted isometry property for time-frequency structured random matrices. Prob. Theor. Relat. Field. to appearGoogle Scholar
  122. 393.
    Y. Plan, R. Vershynin, One-bit compressed sensing by linear programming. Comm. Pure Appl. Math. 66(8), 1275–1297 (2013)MATHCrossRefGoogle Scholar
  123. 394.
    Y. Plan, R. Vershynin, Robust 1-bit compressed sensing and sparse logistic regression: a convex programming approach. IEEE Trans. Inform. Theor. 59(1), 482–494 (2013)MathSciNetCrossRefGoogle Scholar
  124. 397.
    L. Potter, E. Ertin, J. Parker, M. Cetin, Sparsity and compressed sensing in radar imaging. Proc. IEEE 98(6), 1006–1020 (2010)CrossRefGoogle Scholar
  125. 401.
    D. Potts, M. Tasche, Parameter estimation for exponential sums by approximate Prony method. Signal Process. 90(5), 1631–1642 (2010)MATHCrossRefGoogle Scholar
  126. 402.
    R. Prony, Essai expérimental et analytique sur les lois de la Dilatabilité des fluides élastiques et sur celles de la Force expansive de la vapeur de l’eau et de la vapeur de l’alkool, à différentes températures. J. École Polytechnique 1, 24–76 (1795)Google Scholar
  127. 406.
    R. Ramlau, G. Teschke, Sparse recovery in inverse problems. In Theoretical Foundations and Numerical Methods for Sparse Recovery, ed. by M. Fornasier. Radon Series on Computational and Applied Mathematics, vol. 9 (de Gruyter, Berlin, 2010), pp. 201–262Google Scholar
  128. 407.
    M. Raphan, E. Simoncelli, Optimal denoising in redundant representation. IEEE Trans. Image Process. 17(8), 1342–1352 (2008)MathSciNetCrossRefGoogle Scholar
  129. 408.
    H. Rauhut, Random sampling of sparse trigonometric polynomials. Appl. Comput. Harmon. Anal. 22(1), 16–42 (2007)MathSciNetMATHCrossRefGoogle Scholar
  130. 409.
    H. Rauhut, On the impossibility of uniform sparse reconstruction using greedy methods. Sampl. Theor. Signal Image Process. 7(2), 197–215 (2008)MathSciNetMATHGoogle Scholar
  131. 411.
    H. Rauhut, Compressive sensing and structured random matrices. In Theoretical Foundations and Numerical Methods for Sparse Recovery, ed. by M. Fornasier. Radon Series on Computational and Applied Mathematics, vol. 9 (de Gruyter, Berlin, 2010), pp. 1–92Google Scholar
  132. 412.
    H. Rauhut, G.E. Pfander, Sparsity in time-frequency representations. J. Fourier Anal. Appl. 16(2), 233–260 (2010)MathSciNetMATHCrossRefGoogle Scholar
  133. 416.
    H. Rauhut, R. Ward, Sparse Legendre expansions via 1-minimization. J. Approx. Theor. 164(5), 517–533 (2012)MathSciNetMATHCrossRefGoogle Scholar
  134. 417.
    B. Recht, A simpler approach to matrix completion. J. Mach. Learn. Res. 12, 3413–3430 (2011)MathSciNetGoogle Scholar
  135. 418.
    B. Recht, M. Fazel, P. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)MathSciNetMATHCrossRefGoogle Scholar
  136. 427.
    J.K. Romberg, Imaging via compressive sampling. IEEE Signal Process. Mag. 25(2), 14–20 (March, 2008)CrossRefGoogle Scholar
  137. 429.
    R. Rubinstein, M. Zibulevsky, M. Elad, Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Trans. Signal Process. 58(3, part 2), 1553–1564 (2010)Google Scholar
  138. 431.
    M. Rudelson, R. Vershynin, Geometric approach to error-correcting codes and reconstruction of signals. Int. Math. Res. Not. 64, 4019–4041 (2005)MathSciNetCrossRefGoogle Scholar
  139. 436.
    L. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)MATHCrossRefGoogle Scholar
  140. 441.
    F. Santosa, W. Symes, Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Stat. Comput. 7(4), 1307–1330 (1986)MathSciNetMATHCrossRefGoogle Scholar
  141. 444.
    B. Schölkopf, A. Smola, Learning with Kernels (MIT Press, Cambridge, 2002)Google Scholar
  142. 447.
    Y. Shrot, L. Frydman, Compressed sensing and the reconstruction of ultrafast 2D NMR data: Principles and biomolecular applications. J. Magn. Reson. 209(2), 352–358 (2011)CrossRefGoogle Scholar
  143. 450.
    J.-L. Starck, E.J. Candès, D.L. Donoho, The curvelet transform for image denoising. IEEE Trans. Image Process. 11(6), 670–684 (2002)MathSciNetCrossRefGoogle Scholar
  144. 451.
    J.-L. Starck, F. Murtagh, J. Fadili, Sparse Image and Signal Processing: Wavelets, Curvelets, Morphological Diversity (Cambridge University Press, Cambridge, 2010)CrossRefGoogle Scholar
  145. 455.
    T. Strohmer, B. Friedlander, Analysis of sparse MIMO radar. Preprint (2012)Google Scholar
  146. 468.
    G. Tauböck, F. Hlawatsch, D. Eiwen, H. Rauhut, Compressive estimation of doubly selective channels in multicarrier systems: leakage effects and sparsity-enhancing processing. IEEE J. Sel. Top. Sig. Process. 4(2), 255–271 (2010)CrossRefGoogle Scholar
  147. 469.
    H. Taylor, S. Banks, J. McCoy, Deconvolution with the 1-norm. Geophysics 44(1), 39–52 (1979)CrossRefGoogle Scholar
  148. 472.
    V. Temlyakov, Greedy Approximation. Cambridge Monographs on Applied and Computational Mathematics, vol. 20 (Cambridge University Press, Cambridge, 2011)Google Scholar
  149. 473.
    R. Tibshirani, Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58(1), 267–288 (1996)MathSciNetMATHGoogle Scholar
  150. 474.
    J. Traub, G. Wasilkowski, H. Woźniakowski, Information-based Complexity. Computer Science and Scientific Computing (Academic Press Inc., Boston, MA, 1988) With contributions by A.G.Werschulz, T. Boult.Google Scholar
  151. 476.
    J.A. Tropp, Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theor. 50(10), 2231–2242 (2004)MathSciNetCrossRefGoogle Scholar
  152. 478.
    J.A. Tropp, Algorithms for simultaneous sparse approximation. Part II: Convex relaxation. Signal Process. 86(3), 589–602 (2006)MathSciNetMATHCrossRefGoogle Scholar
  153. 479.
    J.A. Tropp, Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theor. 51(3), 1030–1051 (2006)MathSciNetCrossRefGoogle Scholar
  154. 482.
    J.A. Tropp, On the linear independence of spikes and sines. J. Fourier Anal. Appl. 14(5–6), 838–858 (2008)MathSciNetMATHCrossRefGoogle Scholar
  155. 487.
    J.A. Tropp, A.C. Gilbert, M.J. Strauss, Algorithms for simultaneous sparse approximation. Part I: Greedy pursuit. Signal Process. 86(3), 572–588 (2006)MATHCrossRefGoogle Scholar
  156. 488.
    J.A. Tropp, J.N. Laska, M.F. Duarte, J.K. Romberg, R.G. Baraniuk, Beyond Nyquist: Efficient sampling of sparse bandlimited signals. IEEE Trans. Inform. Theor. 56(1), 520–544 (2010)MathSciNetCrossRefGoogle Scholar
  157. 497.
    S. Vasanawala, M. Alley, B. Hargreaves, R. Barth, J. Pauly, M. Lustig, Improved pediatric MR imaging with compressed sensing. Radiology 256(2), 607–616 (2010)CrossRefGoogle Scholar
  158. 507.
    Y. Wiaux, L. Jacques, G. Puy, A. Scaife, P. Vandergheynst, Compressed sensing imaging techniques for radio interferometry. Mon. Not. Roy. Astron. Soc. 395(3), 1733–1742 (2009)CrossRefGoogle Scholar
  159. 508.
    P. Wojtaszczyk, A Mathematical Introduction to Wavelets (Cambridge University Press, Cambridge, 1997)MATHCrossRefGoogle Scholar
  160. 512.
    G. Wright, Magnetic resonance imaging. IEEE Signal Process. Mag. 14(1), 56–66 (1997)CrossRefGoogle Scholar
  161. 513.
    J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust Face Recognition via Sparse Representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)CrossRefGoogle Scholar
  162. 519.
    J. Zou, A.C. Gilbert, M. Strauss, I. Daubechies, Theoretical and experimental analysis of a randomized algorithm for sparse Fourier transform analysis. J. Comput. Phys. 211, 572–595 (2005)MathSciNetCrossRefGoogle Scholar
  163. 520.
    A. Zymnis, S. Boyd, E.J. Candès, Compressed sensing with quantized measurements. IEEE Signal Process. Lett. 17(2), 149–152 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Simon Foucart
    • 1
  • Holger Rauhut
    • 2
  1. 1.Department of MathematicsDrexel UniversityPhiladelphiaUSA
  2. 2.Lehrstuhl C für Mathematik (Analysis)RWTH Aachen UniversityAachenGermany

Personalised recommendations