Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms

We present a technical survey on the state of the art approaches in data reduction and the coreset framework. These include geometric decompositions, gradient methods, random sampling, sketching and random projections. We further outline their importance for the design of streaming algorithms and give a brief overview on lower bounding techniques.


Introduction
More is more is one of the central tenets associated with Big Data. More data means more information from which we hope to gain a better understanding of an underlying truth. The artificial intelligence and machine learning communities have focused on modeling and learning increasingly elaborate statistical models for obtaining new knowledge from the data. However, in the era of Big Data, scalability has become essential for any learning algorithm. In light of this, research has begun to focus on aggregation tools that provide tradeoffs between information and space requirement. Note, that these are also extremely valuable for practical applications. The design of a data aggregation can be decoupled from the design of the actual learning algorithm. This is commonly referred to as the sketch-and-solve paradigm. The main idea is to reduce the size of the data first, to have a sketch whose size has only little or even no dependency on the initial size 1 : "Big Data turns into tiny data!" Then a learning algorithm is applied to the little sketch only. Its complexity measures like running time, memory, and communication are thus significantly less dependent on the size of the initial data. In most cases, the learning algorithm remains unchanged. For a given data aggregation, the main challenge is to bound the trade-off between its size and its loss of information.
What constitutes relevant information is, of course, application dependent. Coresets are an algorithmic framework towards quantifying such trade-offs for various objective functions of interest. Generally speaking, we are given a data set A, a set of candidate solutions C, and some optimization function f. Our aim is to find a significantly smaller data set S such that for all candidate solutions c ∈ C , f(S, c) approximates f(A, c).
Coresets turn out be extremely useful in the context of both, approximation and streaming algorithms. Given that a problem admits a coreset whose size is independent of the input, a polynomial time approximation scheme (PTAS) can often be derived via brute force enumeration. While such a PTAS will not necessarily be of practical relevance, this example illustrates how coresets can make computationally expensive algorithms more viable. In a streaming setting, we are given the further restriction that we do not have a random access to the data, but must process the data points one by one. It turns out that if a problem admits a coreset, there exists a black box reduction to streaming algorithms commonly referred to as the merge-and-reduce principle, cf. [49]. It is thus not necessary to develop a special streaming algorithm for that problem.
Coresets have been studied for several base problems that arise in many applications of artificial intelligence and machine learning. Some examples include clustering [4,17,42,50,75], classification [56,59,89], regression [34,40,41,54], and the smallest enclosing ball problem [6,15,16,48]; we refer to [84] for a recent extensive literature overview. If one of these base problems arises in some application, a coreset construction can often be used in a black box manner to improve the performance of the learning algorithm. This survey, in contrast, is intended to give the reader an overview over existing methods to design and analyze new coreset constructions for individual learning tasks of interest. We believe such an methodological introduction to coresets, and their benefits in the design of fast approximation-and streaming algorithms can be very valuable to the artificial intelligence community.
This survey is now organized into the following parts. In Sect. 2, we describe some basic facts and notations and give a formal definition of the strong coreset guarantee. In Sect. 3 For each technique we will consider an example problem and a coreset construction. We further give an overview on results in this line of research. In Sect. 4, we show how coresets may be used in the context of streaming algorithms. In Sect. 5, we discuss lower bounds on coreset sizes, and provide example problems for which coresets do not exist. We conclude with some open problems.

Preliminaries
We assume 0 < , ≤ 1 2 for all approximation resp. failure parameters in this paper. We first give the definition of the strong coreset guarantee.

Definition 1 (Strong Coresets) Let
A be a subset of points of some universe U and let C be a set of candidate solutions. Let f ∶ U × C → ℝ ≥0 be a non-negative measurable function. Then a set S ⊂ U is an -coreset, if for all candidate solutions c ∈ C we have This first paper to formalize the coreset framework is that of Agarwal et al. [4]. In many cases, the construction of a coreset is a randomized process. Here, we account for the probability of failure via the parameter , which in turn influences the size of S. We are interested in coresets S whose size is significantly smaller than |A|. Specifically, we aim to find an S such that |S| ∈ polylog(|A|) 2 or even such that |S| has no dependency on |A|. The size of |S| will almost always depend on , and often on properties of the universe U (e.g. a dependency on d if the points lie d-dimensional Euclidean space) or on properties of f (e.g. a dependency on k, if when we are dealing with the k-means objective function). If there exists no constant c > 0 such that |S| ∈ o(|A| 1−c ) , we say that no coreset exists.
There exists another notion of coreset whose use is not as clear cut as the strong coreset. Roughly speaking, we only require that a solution computed on S is a (1 + ) approximation to the optimum solution of A. In this case, we say that S is a weak coreset, though we emphasize that there exist no unifying definition on the guarantees of a weak coreset and there are many variations encountered in literature.
We note that while S often is a subset of A, it does not necessarily have to be and we will see examples for this. In most cases the universe U is a Euclidean space, where the We will frequently use a standard result on ball-covers for Euclidean spaces. An -ball-cover of the unit sphere is a set of points B, such that for any point p in the unit sphere the distance to any point of B is at most . We note that while this bound is sharp, there exists no known efficient method of constructing such a ball-cover. In most cases, we require only the existence of a small ballcover for the analysis. If an algorithm uses an actual ballcover, there are multiple options to construct one using −O(d) points, see Chazelle for an extensive overview [26].

Construction Techniques
We have identified four main approaches used to obtain coresets. This is not meant to be an absolute classification of algorithms, but rather an overview of what techniques are currently available and an illustration of how we may apply them.

Geometric Decompositions
Assume that we are given a point set A in Euclidean space. A simple way to reduce the size of A is to compute a discretization S of the space, snap each point of A to its nearest neighbor in S, and use the (weighted) set S to approximate the target function. Many of these decompositions are based on packing arguments and range spaces. We will illustrate one approach via the k-means problem, based on the papers by Har-Peled and Mazumdar [58] and Fichtenberger et al. [51].
Definition 2 (k-Means Clustering) Let A be set of n points in Euclidean space and let C be a set of points with |C| = k . The k-means objective asks for a set C minimizing We first take a closer look at the objective function for Euclidean k-means. Let OPT be the value of the optimum solution. We first compute a 10-approximation 3 C ′ to the k-means problem, which can be done in polynomial time [68].
Consider now a sequence of balls with exponentially increasing radius centered around each point of C ′ , starting at radius 1 n ⋅ OPT and ending at 10 ⋅ OPT . Our discretization will consist of a suitably scaled -ball-cover of each ball. To prove correctness, we require the following generalization of the triangle inequality.

Lemma 2 (Generalized Triangle Inequality [71])
Let a, b, c be points in Euclidean space. Then for any ∈ (0, 1) we have We now show that the cost of using the discretization is bounded.

Lemma 3
Let A be a set of points, let B i be the ball with radius r i ∶= 2 i n ⋅ ∑ x∈A ‖x‖ 2 centered at the origin and let Proof Denote by A close the points with squared Euclidean norm at most 1 n ⋅ ∑ x∈A ‖x‖ 2 and by A far the remaining points.
For the points We can reapply this analysis for each center of the 10-approximation C ′ . Observe that the cost of points A c assigned to a center c ∈ C � is ∑ x∈A c ‖x − c‖ 2 , which is equal to ∑ x∈A ‖x‖ 2 if we treat c as the origin. Combining this lemma with Lemma 2 and rescaling shows that S is a coreset. Proof For each of the k centers from the original 10-factor approximation we have a total of log 10n balls with varying radii. For each such ball of radius r, we compute an 16 ⋅ r ball-cover. For any point x ∈ A , let B(x) be the nearest point in the union of all ball-covers. By Lemma 3, we have OPT . Now consider an arbitrary set of centers C. We have where the first inequality follows from Lemma 2, the second inequality follows follows from Lemma 1, and the last inequality follows from OPT ≤ ∑ x∈A min c∈C ‖x − c‖ 2 for any set of centers C.
Rescaling by a factor of 1 / 4 gives the proof. The space bound now follows from Lemma 1 and the fact that we compute a ( ∕64)-ball-cover k ⋅ log(10n) times. □ Bibliographic Remarks Algorithms based on a discretization and then snapping points to the closest point of the discretization are particular common for extent approximation problems such as maintaining -kernels, the directional width, the diameter, and the minimum enclosing ball 3 Any constant factor would do. We only fix the constant for sake of exposition.

3
problem. In fact, the notion behind coresets was introduced in the context of these problems in the seminal paper by Agarwal et al. [4]. Since then, a number of papers have progressively reduced the space bound required to maintain coresets [8,23,24,99] with Arya and Chan giving an algorithm that stores O( −(d−1)∕2 ) points [14] in d-dimensional Euclidean space. We will briefly outline in Sect. 5 that this space bound is indeed optimal. The geometric approach was also popular when coresets were introduced to k-median and k-means clustering and generalizations, see for instance [44,51,52,57,58]. Due to the exponential dependency inherent in all known constructions, the focus later shifted to sampling. Nevertheless, geometric arguments for correctness proofs are necessary for almost all algorithms, and are present in all the other examples presented in this survey. For an extensive and more complete overview of purely geometric algorithms for coreset construction, we recommend the survey by Agarwal et al. [5]

Gradient Descent
A quite popular and often only implicitly used technique to build coresets is derived from convex optimization. We begin with an example where a simple application of the well-known sub-gradient method yields the desired coreset. Consider the problem of finding the smallest enclosing ball of a set of points, which is equivalent to the 1-center problem in Euclidean space.

Definition 3
Given P ⊂ ℝ d , the smallest enclosing ball problem (SEB) consist in finding a center c * ∈ ℝ d that minimizes the cost function A standard approach for minimizing the convex function f is to start at an arbitrary point c 0 and to perform iterative updates of the form c i+1 = c i − s ⋅ g(c i ) where s ∈ ℝ >0 is an appropriately chosen step size and g(c i ) ∈ f (c i ) is a subgradient of f at the current point c i . We will refer to this as the sub-gradient method. This method has been formally developed by several mathematicians in the sixties. See [83,90,92] for details and historical remarks. Now, consider the following theorem regarding the convergence behavior of the sub-gradient method. From Theorem 2 we learn, that if our function has a small Lipschitz constant and our initial starting point is not too far from the optimal solution, then the best center among a small number of iterations of the sub-gradient algorithm will have a radius that is close to the optimal radius of the smallest enclosing ball. To assess the parameters more closely we begin with finding a sub-gradient at each step. It is easy to see that for any p max ∈ argmax p∈P ‖c − p‖ attaining the maximum distance g(c) = c−p max ‖c−p max ‖ is a sub-gradient of f at c, i.e. g(c) ∈ f (c) . Also note that c − p max ≠ 0 unless P is a singleton, in which case the problem is trivial since the only input point is the optimal center. Thus, in all interesting cases, g(c) is well-defined. Also note that g(c) is by definition a normalized vector, i.e., ‖g(c)‖ = 1 which implies that f is L-Lipschitz with L = 1 , since by definition of a subgradient and applying the Cauchy-Schwarz inequality (CSI) we have This leads to the following corollary. Proof Note that if we pick an arbitrary input point as our starting point c 0 = p 0 ∈ P and another point p 1 ∈ P that maximizes the distance to p 0 then we can deduce that So, in the light of Theorem 2, R is a good approximation for determining an appropriate step size, and at the same time acts as a good parameter for the upper bound on the error. Now plugging Now we would like to argue that the set of points S, that the sub-gradient algorithm selects in each iteration as furthest point from the current center, forms a (weak) coreset for the smallest enclosing ball of P. We will need the following well-known fact for our further analysis.

Lemma 4 ([17])
Let B(c, r) be the smallest enclosing ball of a point set P ⊂ ℝ d , then any closed halfspace that contains c must also contain at least one point from P that is at distance r from c.
Using this lemma we can argue that any (1 + )-approximate center must be within √ f * distance to the optimal center.
Proof Let c − c be normal to the hyperplane H passing through c and let H − be the halfspace whose border is H containing c but not containing c . By Lemma 4 we know that there must be some p ∈ P ∩ H − such that ‖c − p‖ = r . By the law of cosines we have Taking the square root yields the claim. □ Now we are ready to prove the main result, namely that running O( 1 4 ) iterations, S is indeed a weak -coreset for the smallest enclosing ball of P.

Theorem 3 Let S be the set of points that the sub-gradient method chooses as furthest points from the current center in each of O( 1 4 ) iterations. Then S is an -coreset for the smallest enclosing ball of P.
Proof Note, that starting with the same point c 0 , the decisions made by the algorithm on input S is the same as on input P. Therefore Corollary 1 applies to both of the sets S and P. (Ties in furthest point queries are assumed to be resolved lexicographically.) Now consider the optimal center c * for P and the optimal center c S for S as well as the best solution c found by the sub-gradient method. We can run the gradient algorithm for a number of l = O( 1 4 ) iterations to get a center c which is a (1 + 2 )-approximation for both sets. By Lemma 5 we have that ‖c * −c‖ ≤ Rescaling by a factor of 1 2 √ 3 and again leveraging the triangle inequality

Bibliographic Remarks
The presented result is rather weak compared to the optimal coreset size of ⌈ 1 ⌉ , cf. [16]. However, we chose to present the method in this way since it shows that the gradient method in its basic form can already yield constant size coresets that do neither depend on the number of input points nor on their dimension. Therefore we see this as a generic method for dimensionality reduction and the design of coresets for computational problems. Putting more effort into geometrically analyzing every single step of the gradient algorithm, i.e., leveraging more problem specific structure in the proof of Nesterov's theorem (Thm. 2, cf. [83]), can lead to even better results which we want to survey here. Badoiu and Clarkson present in their short and elegant seminal work [15] that their gradient algorithm converges to within f * error in O( 1 2 ) iterations similar to Corollary 1. Their analysis is stronger in the sense that it implicitly shows that the points selected by the algorithm form an -coreset of size O( 1 2 ) . Such a result was obtained before in [17] by picking a point in each iteration which is far enough away from the center of the smallest enclosing ball of the current coreset. Taking the furthest point instead, yields the near-optimal bound of 2 from [15]. The complexity was settled with matching upper and lower bounds of ⌈ 1 ⌉ in [16]. The smallest enclosing ball problem is by far not the only one for which looking at gradient methods can help. For instance Har-Peled et al. [60] have used similar construction methods to derive coresets for support vector machine training. Clarkson [29] has generalized the method by unifying and strengthening the aforementioned results (among others) into one framework for coresets, sparse approximation and convex optimization. A probabilistic sampling argument was already used in [17] as a dimensionality reduction technique for the 1-median problem which led to the first linear time approximation algorithms for the k-median problem in Euclidean and more general metric spaces [3,69]. The method itself is closely related to stochastic sub-gradient methods where uniform sub-sampling is used for obtaining an unbiased estimator for the actual sub-gradient in each step [92]. Similar approaches are currently under study to reduce the exponential dependency on the dimension for the probabilistic smallest enclosing ball problem [48].

Random Sampling
The arguably most straightforward way to reduce the size of dataset is to simply pick as many points as permissible uniformly at random. Though it is rare for an algorithm to produce a coreset in this straightforward fashion, we will see that for outlier-resistant problems this approach can already provide reasonable results. In the following we will examine the geometric median, which is one of the few problems for which uniform sampling gives any form of guarantee. Thereafter, we will show how to obtain strong coresets using more involved sampling distributions.

The geometric median m(A) minimizes
We will show that uniform sampling is sufficient to obtain a weak-coreset guarantee, The exposition for uniform sampling follows Thorup [94] and Ackerman et al. [3].

Lemma 6 Let A be a set of points in ℝ d and let S be a uniform sample of A. Then for any
Due to the triangle inequality, each summand is in between 0 and 1. Furthermore, Using Eq. 1, we have Applying the Chernoff bound, we have □ We wish to use Lemma 6 to apply a union bound on all candidate points. In a finite metric consisting of n points, we can set |S| ≥ 144 −2 log n and are done. In the continuous setting, this is not as straightforward. We will use a suitably scaled ball-cover and argue that this already suffices to prove the claim for all points.

Theorem 4 Let A be a set of n points in ℝ d and let S be a uniform sample of A with |S| ∈ (d −2 log d ) . Let m(A) and m(S) be the geometric medians of A and S, respectively. Then with constant probability, we have
By Markov's inequality and a union bound over S, all points in S will be contained in the ball B of radius 4|S| OPT n centered around m(A) with probability at least 1 / 4. Obviously, this means that the optimal median m(S) of |S| will be contained in B. Let C be a 5 OPT n -ball-cover of B. Setting |S| ≥ k ⋅ d −2 log d for a sufficiently large absolute constant k ensures that Lemma 6 holds for all points in C with probability at least 1 / 4. Now let c ∈ C be the closest point in the ball-cover to m(S). We have Obviously, the uniform sampling approach can not yield a strong coreset even for the geometric median problem. The reason is that the cost can rely on a small constant number of input points even in one dimension. Given any input P of size n − 2 , we can place two additional points p min < min P and p max > max P arbitrarily far from the points of P without affecting the median. Clearly, any strong coreset must contain these points to preserve the cost for all possible centers, but the probability of sampling one of them is only 2 n . This implies that a strong coreset based on uniform sampling must have linear size. To overcome this limitation, Langberg and Schulman [71] introduced the notion of sensitivity for families of functions. For simplicity of presentation, we consider only the special case of the geometric median problem. For a much more general view on sensitivity sampling for numerous other functions the reader is referred to the original literature [45,71] Definition 5 (Sensitivity) We define the sensitivity of a point x ∈ P as We define the total sensitivity as S(P) = ∑ x∈P s(x).
Informally, the sensitivity of a point measures how important it is for preserving the cost over all possible solutions. Note, that from the definition follows for all x ∈ P and all centers c ∈ ℝ d . The total sensitivity will turn out to be a crucial measure for the complexity of sampling from the distribution given by the sensitivities of the input points. We begin with describing the sampling scheme. The following importance or weighted sampling approach yields an unbiased estimator for the cost of any fixed center f (P, c) = ∑ x∈P ‖x − c‖ . We sample a point x ∈ P from the distribution q(x) = s(x) S(P) and set T = ‖x−c‖ q(x) . A straightforward calculation of its expected value Next we bound its variance which is a crucial parameter in deriving concentration results for our sampling procedure.
The inequality follows from Eq. (2). □ In our next lemma we will see why the total sensitivity plays such an important role for the random sampling proportional to the sensitivities. . We bound the exponent by ] ≤ which implies our claim by taking a union bound over the set C of size |C| ≤ . □ Lemma 8 shows that the number of samples that we need, depends linearly on the total sensitivity as well as logarithmically on the number of centers for which we need a guarantee. First we want to bound the total sensitivity for the geometric median problem. If for every input point, there is some center such that the points' contribution to the cost is large, say bounded below by a constant, then we are lost since in that case the total sensitivity sums up to (n) . The next lemma shows that this cannot happen. Actually it turns out that the total sensitivity can be bounded by a constant.
Proof Let c * be t he optimal center and let = ∑ x∈P ‖x − c * ‖ > 0 denote the optimal cost. Let c ∈ ℝ d be any center, let = ∑ x∈P ‖x − c‖ ≥ denote its cost and let = ‖c − c * ‖ be its distance to the optimal center. Now consider a ball B centered at c * with radius 2 n . By the Markov inequality we have that the number of points inside the ball is at least |P ∩ B| ≥ n 2 . Therefore we have ≥ n 2 max(0, − 2 n ) . Combining both lower bounds we G, restricted to ≥ 2 n is clearly monotonously decreasing in and is thus maximized at G( 2 n ) = 4 n + 2‖x−c * ‖ . We can conclude that S(P) is bounded by Now we know that the total sensitivity contributes only a constant factor to our sampling complexity. Before we move to the main result of this section, we will need another technical lemma. It is not immediately clear that we can use the triangle inequality for our samples, due to reweighting. We will therefore establish a relaxation of the triangle inequality for the entire weighted subsample.

Lemma 10
Let > 0. Let R be a random sample of size m ≥ 18 2 ln( 2 ) drawn i.i.d. from P proportional to the distribution q and reweighted by w(x) = 1 mq(x) . Then with probability at least 1 − for any choice of two centers c, c � ∈ ℝ d we have Proof The tricky part is to bound the total weight of the sample 1 m ∑ x∈R 1 q(x) . In the light of Lemmas 7 and 8, we define the random variables T i = 1 q(x) . Their expectation is To bound their variance, we first need an upper bound on 1 s(x) . To this end, consider a ball B ⊃ P containing all input points, centered at the optimal center c * . Let be the diameter of B and let b ∈ B be the closest point to c located on the surface of B. Note that ∀p ∈ P ∶ ‖p − b‖ ≤ . Then

Now, [T i ] ≤ (S − 1) [T i ] follows via very similar calculations as in Lemma 7. Inequality (2) is simply replaced by Inequality (3) in the derivation.
Following the arguments from Lemma 8 closely, we can conclude that with high probability � 1 Back to the main claims, the standard triangle inequality yields T h e s e c o n d c l a i m fo l l ows s i m i l a rly u s i n g ‖x − c‖ ≥ ‖c − c � ‖ − ‖x − c � ‖ in the numerator. □ It remains to bound the number of centers for which we will need a guarantee. This seems impossible at first glance, since the strong coreset property asks to hold for every center c ∈ ℝ d which has infinite cardinality. In the following we will see, that again, it is sufficient to consider a ball of appropriate radius centered at the optimum and decompose it by an -ball-cover. Lemma 8 is applied to these points only. Now, for every center inside the ball, there is a point of the cover close to it for which the coreset guarantee holds. For the other centers we can argue that they are so far from the optimum center that their distance dominates the cost for both, the original point set as well as for the coreset. This will establish the coreset property for every possible center. . Then R is a strong coreset for the geometric median of P with probability at least 1 − .
Proof Let c * be the optimal center for P and let OPT = cost(P, c * ) denote the optimal cost. Consider the closed ball B of radius r = OPT n centered at c * . Let C be the set of center points of a OPT n -ball-cover of B. For technical reasons we add c * to C. Recall from Lemma 1 that . We apply Lemma 8 to C with m = 18 2 ln( 2 ) = ( d 2 ln 1 ) . Thus we can assume that with probability 1 − for all c ∈ C simultaneously and in particular for c * we have |cost(R,c) − cost(P,c)| ≤ cost(P,c) . Also we have that the two triangle inequalities from Lemma 10 hold with probability 1 − . Now let c ∈ ℝ d be any center. We distinguish between the cases c ∈ B and c ∉ B . In the former case let c ∈ C be the closest center to c in our cover, i.e., ‖c −c‖ ≤ OPT n . Using Lemma 10 we have The lower bound of (1 − 4 ) cost(P, c) can be derived similarly and is omitted for brevity of presentation. We are left to deal with the case c ∉ B . The inequality to keep in mind is ‖c − c * ‖ > OPT n or, equivalently cost(P, c * ) = OPT < n‖c − c * ‖ . From and it follows that cost(R,c) cost(P,c) ≤ 1+3 1− ≤ 1 + 8 .
The lower bound of 1 − 3 can be verified very similarly and is again omitted from the presentation. Rescaling and by absolute constants concludes the proof. □

Bibliographic Remarks
The arguably first paper to use random sampling for coreset construction is due to Chen [27]. Like the result on k-means described in Sect. 3.1, he considered balls of exponential increasing radii. Instead of using a ball-cover, he showed that a uniform random sample suffices to approximate the cost and thereby gave the first coresets for k-means and k-median with polynomial dependency on k, d, , and log n . The sensitivity sampling approach was introduced by Langberg and Schulman [71] and further improved by Feldman and Langberg [45] and Braverman, Feldman and Lang [22].
The technique is now one of the crown jewels of coreset construction, giving the best currently available parameters for a wide range of problems including p regression, k-means and k-median clustering, and low rank subspace approximation. It has defined a unified framework for several importance sampling approaches from the early literature of sampling based sublinear approximation algorithms and coresets. One of the first such works is due to Clarkson [28] on 1 regression. After a series of studies on sampling based randomized linear algebra [36][37][38], the approach could also be adapted for 2 regression [40,41] and was later generalized in [34] to p regression for p ∈ [1, ∞) . In accordance with the statistical meaning of the sampling weights, they were later called (statistical) leverage scores. It was an open problem proposed in [40] to even approximate the 2 leverage scores in less time than needed to solve the regression problem exactly. This problem was resolved in [39] via the oblivious random projection approach we are going to detail in the next section. Low rank subspace approximation was often treated implicitly using the same approaches leading to weak coresets, cf. [91] for a technical review. The first strong coresets of polynomial size for p subspace approximation are due to Feldman et al. [47] again achieved via weighted sampling techniques related to the notion of sensitivity. More recently, the sampling based methods from randomized linear algebra [41] were leveraged to develop coresets in [79] for dependency networks [61] and Poisson dependency networks [55].

Sketches and Projections
It is often useful to view a dataset as a matrix, where the ith row corresponds to the ith point. In the following, let us assume that we have n points in d-dimensional Euclidean space, i.e. we are given a matrix A ∈ ℝ n×d .
A sketch of A is a linear projection obtained by multiplying A with a sketching matrix S ∈ ℝ m×n for some m ≪ n . Our goal is to design a sketching matrix such that SA retains the key properties of A. Many of the previous coreset constructions can also be viewed in terms of sketching matrices. For instance, a sampling algorithm can be viewed as choosing a diagonal matrix S ∈ ℝ n×n where the diagonal entries are 1 (or some weight) if the point was picked and 0 otherwise in which case the row can as well be deleted.
We are more interested in oblivious sketching matrices, that is the sketching matrix S can be constructed ahead of time without viewing the data. Though it might seem surprising that this yields any results, sketching is now regarded as one of the central tools for streaming. We will illustrate the power of oblivious sketching in the context of subspace embedding and then show how it may be applied to linear regression.

Definition 6
Let U ∈ ℝ n×d be a matrix with orthogonal unit columns and let > 0 . An -subspace embedding of U is a matrix SU ∈ ℝ m×d for m ≪ n such that for any vector x ∈ ℝ d , we have The central ingredient is based around the seminal Johnson-Lindenstrauss lemma [64].

Lemma 11 (Distributional Johnson-Lindenstrauss Lemma) There exists a distribution D over m × n matrices with m ∈ O( −2 log(1∕ )) such that for a matrix drawn from D and any fixed vector x ∈ ℝ n we have
The classic variant of the distributional Johnson-Lindenstrauss Lemma further states that an adequate distribution independently draws each entry of the matrix as a Gaussian with mean 0 and variance 1. Any distribution with mean 0 and variance 1 may be used, and nowadays, it is more common to draw entries from the Rademacher distribution, where each entry is with probability 1 / 2 either 1 or −1 . We will briefly discuss advantages of various sketching matrices at the end of this section.
Lemma 11 now gives us a powerful dimension reduction tool for Euclidean spaces. Consider, for instance, the case where we are given points in an arbitrary number of dimensions. Each distance between two points can be represented by a vector and there are at most 2 such vectors. Using Lemma 11, we can achieve an (1 ± )-approximate embedding into O( −2 log ) dimensions with probability at least 2 / 3 by setting m ≥ c ⋅ −2 log( 2 ) , where c is some absolute constant.
For subspaces, the application of the union bound is not as straightforward as there are infinitely many vectors, even in a 1-dimensional subspace. We first observe that any vector x may be rewritten as ‖x‖ ⋅ x ‖x‖ and hence it is sufficient to consider vectors with unit norm. This solves the subspace approximation problem in a single dimension, but even in 2 dimensions we have infinitely many unitary vectors. Instead, we show that applying the union bound on ball-covers is sufficient. Recall that there exists an -ball-cover of size (1 + 2∕ ) d (c.f. Lemma 1). Applying the union bound now gives us m ≥ . A slightly more detailed analysis will allow us to remove the log −1 factor. The main argument is that we can write every vector as linear combination of the vectors of a 1 / 2-cover. To see this, consider a 1 / 2-cover {x 0 , … x 5 d } and an arbitrary unit vector y. There exists some x i such that ‖y − x i ‖ ≤ 1∕2 . We then consider the vector ‖y − x i ‖x j closest to (y − x i ) .
In each subsequent step, we halve the length of the next vector added and by iterating this sequence into infinity, the length of the remaining vector lim y − ∑ i x i is 0. Hence, there exists a subspace approximation of size O(d∕ 2 ) . Now let us consider linear regression, where we aim to minimize ‖Ax − b‖ with A ∈ ℝ n×d and b ∈ ℝ n . Since all possible vectors Ax − b lie in a (d + 1)-dimensional subspace spanned by the columns of A and b, an oblivious subspace embedding is a coreset for linear regression.

Theorem 6
Let A ∈ ℝ n×d and b ∈ ℝ n . Choose S ∈ ℝ m×n to be a Rademacher matrix, where m ∈ O(d −2 ). Then with constant probability, for all x ∈ ℝ d , we have Bibliographic Remarks Oblivious subspace embeddings were first introduced by Sarlos [91] both as a means for solving regression and as means for computing low rank approximations. The upper bound of O(d∕ 2 ) on the target dimension for oblivious sketches was (implicitly) given in the work by Clarkson and Woodruff [31], and this bound was later proved to be optimal by Nelson and Nguyen [82], though even smaller bounds are possible if the sketch is not oblivious, see for instance [33,49]. It is worth noting that if we are only interested in approximating the optimum using the sketch for linear regression, somewhat akin to a weak coreset guarantee, a target dimension of O(d∕ ) is sufficient and necessary [31].
Projections and sketches also play an important role for coresets in Euclidean k-means [33,49]. Cohen et al. [33] showed that an oblivious sketch of target dimension O(k∕ 2 ) is cost-preserving. As a corollary, this implies that for any coreset construction, the dependency on d may be replaced by a dependency on k∕ 2 .
The computation time for all sketching approaches can be generally regarded as acceptable. The smallest target dimension is achievable by computing the SVD or a sufficiently good low rank approximation [33,49]. While this can be done in polynomial time, it is more expensive than the oblivious sketching methods we describe in the following. Conceptually, there are two basic approaches to multiply sketching matrices faster. The first one is to improve the sketching dimension. This, however, is known to be impossible for various regimes, see [12,63,65,72] and very recently for any embedding method by Larsen and Nelson [73]. The other direction is to make the sketching matrix sparse.
Sparse matrices tend to distort sparse vectors. Clearly, the Johnson-Lindenstrauss guarantee cannot hold in such cases. To remedy this problem, Ailon and Chazelle [9] proposed to first increase the density of the input vectors by preforming a randomized Fourier Transform before multiplying with a very sparse embedding matrix. This approach, called the Fast Johnson Lindenstrauss Transform, was later improved and refined, see for instance [10,11,21,95] and references therein.
The second approach for sparse Johnson Lindenstrauss transforms is based around sophisticated ways of combining Count Sketch estimators. The Count Sketch estimator [25] originally proposed for finding heavy hitters is essentially a single row of a Rademacher matrix. Instead of naively repeating the Count Sketch and averaging the results as in a standard Rademacher matrix, we aim to partition the entries of the vectors that are to be embedded, apply a Count Sketch for each partition and thereafter aggregate the results. The degree of sparsity that may be achieved using various partitioning schemes have been studied in [2,32,35,66,81]. We want to highlight the work by Clarkson and Woodruff [32] who can achieve an embedding in essentially input sparsity time, however at the cost of slightly larger target dimension d 2 2 . The squared dependency on d was also shown to be necessary by Nelson and Nguyen [80].
Another related direction is generalizing to p subspace embeddings. A first step was done by Woodruff and Sohler [93] who designed the first subspace embedding for 1 via Cauchy random variables. The method is in principle generalizable to using p-stable distributions and was improved in [30,77]. The idea is that the sum of such random variables forms again a random variable from the same type of distribution leading to concentration results for the p norm under study. At the same time it is inherently limited to 1 ≤ p ≤ 2 as in [77], since no such distributions exist for p > 2 , cf. [96]. The first attempts to generalize to p > 2 [30] had nearly linear size, namely n∕poly(d) , which clearly was not satisfying. A remedy came with a manuscript of Andoni [13], who discovered the max stability of inverse exponential random variables as a means to embed p , p > 2 with little distortion into ∞ . Combining this with the work on 2 embeddings of Clarkson and Woodruff [32] culminated in oblivious subspace embeddings for all p (into 2 resp. ∞ ) and only poly(d) distortion [96]. Note, that the embedding dimension for p > 2 is n 1−2∕p poly(d) which improved upon the previous n∕poly(d) and is close to optimal given the lower bound of (n 1−2∕p ) [86]. The desirable (1 ± ) distortion can be achieved using the embeddings for preconditioning and sampling proportional to the p leverage scores [30,34,96].
Current research has moved beyond strict algorithmic and optimization related characteristics to the study of statistical properties [76,88]. In particular, not only maximum likelihood estimators are approximated under random projections. Geppert et al. [54] showed that in important classes of Bayesian regression models, the whole structure of the posterior distribution is preserved. This yields much faster algorithms for the widely applicable and flexible, but at the same time computationally demanding Bayesian machinery.
Recently a series of optimization software based on random projections and sketches have appeared. A parallel least squares regression solver LSRN was developed in [78,97]. An implementation of some of the presented sketching techniques named RaProR was made available for the statistics programming language R [53,54,87].

Streaming
Streaming algorithms process the data point by point (or entry by entry) and aim to (approximately) answer queries to the data using as little space as possible. Though we describe the coreset constructions in the last chapters with no streaming implementation in mind, it turns out that if we can find a strong coreset for a problem, there also exists a streaming algorithm with little overhead in terms of space requirement. Once a coreset construction is known for a problem, it is often not necessary to develop a specific streaming algorithm, since one can rely on the following black box reduction: For most functions 4 , coresets have the very useful property of being closed under union, that is, for two point sets, the union of coresets for both point sets is a coreset for the entire point set. To get an idea of how the reduction works, assume that we partition the input sequence into n∕ log n batches of size log n . These batches form the leaves of a binary tree of height h ≤ log n . Whenever we process an entire batch, we compute a coreset. Whenever we have computed a coreset for two children of a node, we aggregate them by recomputing a coreset of the union of the children 1 3 and storing it in the parent node. The children can be deleted at this point. Thus, we only have to store at most two coresets at each level bounding the number of coresets in memory to O(log n).
So, this framework comes not for free, but its blow up remains bounded. If we apply the merging step as a black box, the coreset contained at the root node, i.e. the final output of the algorithm will have an approximation guarantee of (1 + ) log n . Rescaling by 2 log n , we have the desired 1 + 2 log n log n ≤ exp 2 ≤ (1 + ) approximation ratio. Finally, many known constructions require randomization and have some adjustable failure probability . To limit the overall failure probability when processing a stream, is rescaled by the number of coreset constructions. Since the space dependency on is typically log 1 , we incur another factor of O(log n) . In total, the space dependency on log n is increased by a factor of log c+p n , where c is the exponent of in the offline coreset construction and p = 1 if the construction is deterministic and p = 2 if it is randomized.
Heinrich et al. have extended this streaming construction of coresets to an asymptotic error guarantee of → 0 as n → ∞ while the memory remains bounded by polylog(n) . This has led to the notion of asymptotically exact streaming algorithms [62].
The framework, called merge and reduce, was originally introduced by Bentley and Saxe [19] and first applied to streaming by Agarwal, Har-Peled and Varadarajan in their seminal paper [4]. Nowadays, many papers have given more efficient streaming implementations for their coresets. For extent approximation algorithms and k-center clustering, we now have constructions with no dependency on n, see [8,23,24,99], with the currently best algorithms storing O( −(d−1)∕2 ) points [14] for -kernels and O(k −d ) points for k-center [98]. The dependency on the dimension d is exponential for all these algorithms and Agarwal and Sharathkumar [7] showed that no algorithm with polynomial dependency on d can exist, see also Sect. 5, unless one is willing to drop the (1 + ) guarantee for a weaker fixed constant.
k-median and k-means clustering thus far are more reliant on the merge and reduce technique. Certain geometric decompositions avoid this, see [51,52], but have a much larger offline space dependency compared to other constructions. See Table 1 for an overview.
The oblivious sketching algorithms avoid the merge and reduce framework entirely and immediately translate to the streaming algorithm. In fact, they can operate even when processing deletions and entry-wise modifications to an input matrix. Li, Nelson and Woodruff [74] showed that that essentially any such streaming algorithm may be reformulated in terms of linear sketches.

Lower Bounds
Lower bounds for coresets come in two flavors: (1) space complexity in terms of points or bits and (2) impossibility results. We will sketch examples for both. First, let us consider the extent approximation problem.

Definition 7 Let
A be a set of points in ℝ d and let > 0 . An extent approximation is a subset S of A such that for any unit vector w, we have Arya and Chan gave an algorithm storing O( −(d−1)∕2 ) points. We will briefly outline why this is indeed optimal (see Agarwal et al. [5] for details). Consider the unit sphere and the spherical cap with angular radius √ . The height of this cap is 1 − cos( √ ) ≤ 2 . Thus, the extent approximation must contain at least one point from every cap. The radius of each such cap is sin( √ ) = ( √ ) . We know that the bound of Lemma 1 is asymptotically tight, i.e., we require ( √ −d ) space for a √ -ball-cover in d-dimensional space. Combining this with the fact that the unit sphere is a d − 1-dimensional space, we know that ( √ −(d−1) ) = ( −(d−1)∕2 ) points are necessary.
Note, that we assume that the label y i ∈ {−1, 1} of point x i is already folded into x i . Let us first recall the reduction of streaming algorithms to the construction of strong coresets from the last section. Taking it the opposite direction, a lower bound for streaming algorithms gives us a lower bound for strong coresets. Thus, a wide variety of tools from communication complexity becomes available to us. A complete review of all techniques is well out of scope, for further reading we refer to the book by Kushilevitz and Nisan [70]. We will focus on the following communication problem.
The indexing problem is a two party communication game, where the first player Alice has a binary bit string x ∈ {0, 1} n and the second player Bob has an index k ∈ {1, … , n} . Alice is allowed to send one message to Bob, whereupon Bob has to output the kth bit. The number of bits of the transmitted message required by any randomized protocol succeeding with probability at least 2 / 3 over the random choices of the players is in (n) , see [1].
We will use the indexing problem to show that no strong coresets for logistic regression exist.
We first show a reduction to the convex hull membership problem. Here, the stream consist of sequence of points A and at any given time we want to be able to answer whether a given point x lies in the convex hull C(A) or not.

Lemma 12
Let P be a set of n points in ℝ 2 . Let A be a subset of P arriving one after the other in a stream. Then any single pass randomized algorithm deciding with probability 2 / 3 whether some point b ∈ P lies in C(A) requires at least (n) space.
Proof Let x ∈ {0, 1} n be Alice's bit string and let k be Bob's index. For each i ∈ {1, … , n} , define the point p i = (sin(i∕n), cos(i∕n)) . By construction, all points lie on the unit sphere and therefore p k is in the convex hull of any point set ⋃ i∈I p i with I ⊆ {1, … , n} if and only if k ∈ I . For each entry x i = 1 , Alice constructs p i . She then runs a streaming algorithm on all generated points and sends the memory of the streaming algorithm to Bob. Bob then checks whether p k is in the convex hull generated by Alice. Since this solves the indexing problem, the communication complexity of indexing is a lower bound to the space complexity of convex hull membership. □ Corollary 2 Let A and B be two sets of a total n points in 2-dimensional space arriving in a data stream. Then any single pass randomized algorithm deciding with ∑ x i ∈A ln 1 + exp(−w T x i ) . probability 2 / 3 whether A and B are linearly separable requires at least (n) space.
We now return to logistic regression. We have the following theorem.
Theorem 7 For any > 0 and any integer n there exists a set of n points C = A ∪ B, such that any strong -coreset of C for logistic regression must consist of (n 1− ) points.
Proof Let A and B be linearly separable, we have at least one misclassified point. The cost of this point is lower bounded by ln(1 + exp(0)) = ln (2) . Otherwise, let w be a separating hyperplane. Then lim ‖w‖→∞ ∑ x i ∈A ln(1 + exp(−w T x i )) = 0 . Given a single pass randomized algorithm for logistic regression we can distinguish between these two cases. Corollary 2 implies that the space complexity must therefore be (n) . Since the merge and reduce framework incurs a polylog(n) blowup, this implies a lower bound of (n∕polylog(n)) ⊂ (n 1− ) for the space complexity of strong coresets for logistic regression. □ A very similar impossibility result was derived recently for Poisson regression by reduction from communication complexity problems [79]. The paper discusses and demonstrates how coresets can be useful in practice anyway, going beyond the worst-case perspective.

Conclusion and Open Problems
We have outlined techniques and limitations of coreset construction methods. To summarize their benefits in a short statement: Coresets are arguably the state of the art technique to turn 5 "Big Data into tiny data!" Their design and analysis should be considered whenever a new statistical model or learning task is designed. They make the algorithmic assessment more efficient saving time, space, communication, and energy, to tackle the most common resource restrictions associated with Big Data.
There are further lines of research employing coresets, concerning topics like privacy issues [43] or distributed computation [18,67,97], that we have not covered here, but encourage the interested reader to investigate. In the following, we would like to conclude with three important open problems.

Problem 1 Let
A be a set of n points in d-dimensional Euclidean space. Is it possible to deterministically compute a coreset S for the k-median or k-means objective in polynomial time, such that |S| ∈ poly(k, −1 , d, log n)?
All known coreset constructions for k-median and k-means clustering with polynomial dependency on k, −1 , d, and log n are randomized. The best known deterministic constructions are either exponential in d, or exponential in k and −1 . Since we know that small coresets exist, it is possible to compute them by brute force enumeration, which is deterministic but also clearly infeasible.

Problem 2 Let
A be a set of n points in d-dimensional Euclidean space. Is it possible to compute a coreset S for the geometric median objective such that |S| is independent of d and n?
It is a simple exercise to show that coresets with no dependency on n, d, or for that matter exist for the centroid or mean of a point set. Indeed, there exist coresets for the more general Euclidean k-means problem of size poly(k∕ ) [33,49]. The algorithms and proofs are heavily reliant on connections to linear algebra and in particular the fact that Euclidean k-means is a constrained form of lowrank approximation with respect to the Frobenius norm. The unconstrained low-rank approximation of a matrix can be determined via singular value decomposition (SVD). Computational and mathematical aspects of the SVD are well understood. In contrast, far less is known by the k-median analogue of a low-rank approximation and its computation is invariably harder than the SVD. Currently, we neither know of a lower bound stating that a dependency of d is required for a geometric median coreset, nor of a coreset construction consisting of even exp( −1 ) many points. It is known that for the minimum enclosing ball (i.e. 1-center clustering), an exponential dependency on d is necessary.

Problem 3 Let
A be a subset of points of some universe U and let C be a set of candidate solutions. Let f ∶ U × C → ℝ ≥0 be a non-negative measurable function. Let S be the smallest possible -coreset with respect to A and f. Is it possible to always compute a -coreset S ′ , such that |S ′ | ≤ ⋅ |S| for some approximation factor ?
Though we have algorithms that are optimal in the worst case for some problems, the worst case complexity may nevertheless be unappealing, such as is the case for extent problems. It would be nice to have algorithms with instance optimal running times and output sizes. To be more specific, consider the minimum enclosing ball problem. Is it possible to compute a coreset S ′ in time O(|A| + ⋅ |S|) such that |S ′ | ≤ ⋅ |S| , where S is the optimal coreset and a small (ideally constant) factor?