Group-invariant max filtering

Given a real inner product space $V$ and a group $G$ of linear isometries, we construct a family of $G$-invariant real-valued functions on $V$ that we call max filters. In the case where $V=\mathbb{R}^d$ and $G$ is finite, a suitable max filter bank separates orbits, and is even bilipschitz in the quotient metric. In the case where $V=L^2(\mathbb{R}^d)$ and $G$ is the group of translation operators, a max filter exhibits stability to diffeomorphic distortion like that of the scattering transform introduced by Mallat. We establish that max filters are well suited for various classification tasks, both in theory and in practice.


Introduction
Modern machine learning has been extraordinarily successful in domains where large volumes of labeled data are available [48,65].Indeed, highly expressive models can generalize once they fit an appropriately large training set.Unfortunately, many important domains are plagued by a scarcity of data or by expensive labels (or both).One way to bridge this gap is by augmenting the given dataset with the help of a large family of innocuous distortions.In many cases, the distortions correspond to the action of a group, meaning the ground truth exhibits known symmetries.Augmenting the training set by applying the group action encourages the model to learn these symmetries.While this approach has been successful [66,28,48,26], it is extremely inefficient to train a large, symmetry-agnostic model to find a highly symmetric function.One wonders: Why not use a model that already accounts for known symmetries?
This motivates invariant machine learning (e.g., [76,78,41,69,6]), where the model is invariant to underlying symmetries in the data.To illustrate, suppose an object is represented by a point x in a set V , but there is a group G acting on V such that the same object is also represented by gx ∈ V for every g ∈ G.This ambiguity emerges, for example, when using a matrix to represent a point cloud or a graph, since the representation depends on the labeling of the points or vertices.If we apply a G-invariant feature map Φ : V → F , then the learning task can be performed in the feature domain F without having to worry about symmetries in the problem.Furthermore, if Φ separates the G-orbits in V , then no information is lost by passing to the feature domain.In practice, V and F tend to be vector spaces out of convenience, and G is frequently a linear group.
While our interest in invariants stems from modern machine learning, maps like Φ have been studied since Cayley established invariant theory in the nineteenth century [25].Here, we take V = C d and G ≤ GL(V ), and the maps of interest consist of the G-invariant polynomials C[V ] G .In 1890, Hilbert [42] proved that C[V ] G is finitely generated as a C-algebra in the special case where G is the image of a representation of SL(C k ), meaning one may take the feature domain F to be finite dimensional (one dimension for each generator).Since G is not a compact subset of C d×d in such cases, there may exist distinct G-orbits whose closures intersect, meaning no continuous G-invariant function can separate them; this subtlety plays an important role in Mumford's more general geometric invariant theory [55].In general, the generating set of C[V ] G is often extraordinarily large [36], making it impractical for machine learning applications.To alleviate this issue, there has been some work to construct separating sets of polynomials [32,33,34], i.e., sets that separate as well as C[V ] G does without necessarily generating all of C[V ] G .For every reductive group G, there exists a separating set of 2d + 1 invariant polynomials [36,19], but the complexity of evaluating these polynomials is still quite large.Furthermore, these polynomials tend to have high degree, and so they are numerically unstable in practice.In practice, one also desires a quantitative notion of separating so that distant orbits are not sent to nearby points in the feature space, and this behavior is not always afforded by a separating set of polynomials [19].Despite these shortcomings, polynomial invariants are popular in the data science literature due in part to their rich algebraic theory, e.g., [8,58,19,11,6].
In this paper, we focus on the case where V is a real inner product space and G is a group of linear isometries of V .We introduce a family of non-polynomial invariants that we call max filters.In Section 2, we define max filters, we identify some basic properties, and we highlight a few familiar examples.In Section 3, we use ideas from [37] to establish that 2d generic max filters separate all G-orbits when G is finite (see Corollary 13), and then we describe various settings in which max filtering is computationally efficient.In Section 4, we show that when G is finite, a sufficiently large random max filter bank is bilipschitz with high probability; see Theorem 18.This is the first known construction of invariant maps for a general class of groups that enjoy a lower Lipschitz bound, meaning they separate orbits in a quantitative sense.In the same section, we later show that when V = L 2 (R d ) and G is the group of translations, certain max filters exhibit stability to diffeomorphic distortion akin to what Mallat established for his scattering transform in [52]; see Theorem 22.In Section 5, we explain how to select max filters for classification in a couple of different settings, we determine the subgradient of max filters to enable training, and we characterize how random max filters behave for the symmetric group.In Section 6, we use max filtering to process real-world datasets.Specifically, we visualize the shape space of voting districts, we use electrocardiogram data to classify whether patients had a heart attack, and we classify a multitude of textures.Surprisingly, we find that even in cases where the data do not appear to exhibit symmetries in a group G, max filtering with respect to G can still reveal salient features.We conclude in Section 7 with a discussion of opportunities for follow-on work.

Preliminaries
Given a real inner product space V and a group G of linear isometries of V , consider the quotient space V /G consisting of the G-orbits [x] := G • x for x ∈ V .This quotient space is equipped with the metric p − q .
(Indeed, d satisfies the triangle inequality since G is a group of isometries of V .)This paper is concerned with the following function.
Definition 1.The max filtering map p, q .
Sometimes, "max filtering map" refers to the related function The intended domain should be clear from context.
We note that since G consists of linear isometries, it is closed under adjoints, and so the max filtering map can be alternatively expressed as The max filtering map satisfies several important properties, summarized below.
Lemma 2. Suppose x, y, z ∈ V .Then each of the following holds: Proof.First, (a) and (b) are immediate, as is the r = 0 case of (c).For r > 0, observe that q ∈ [ry] precisely when q := r −1 q ∈ [y], since each member of G is a linear isometry.Thus, Next, (d) follows from the identity [x], [y] = sup p∈[x] p, y , which expresses [x], [•] as a pointwise supremum of convex functions.For (e), we apply (d) and (c): without loss of generality.Select any p ∈ [y], q ∈ [z], and > 0, and take g ∈ G such that x, gp > Since p, q, and were arbitrary, the result follows.
Definition 3. Given a template z ∈ V , we refer to [z], • : V /G → R as the corresponding max filter.Given a (possibly infinite) sequence {z i } i∈I of templates in V , the corresponding max filter bank is Φ : In what follows, we identify a few familiar examples of max filters.

Example 4 (Norms)
. There are several norms that can be thought of as a max filter with some template.For example, consider Similarly, the infinity norm is obtained by taking G to be the group of signed permutation matrices and z to be a standard basis element, while the 1-norm comes from taking G to be the group of diagonal orthogonal matrices and z to be the all-ones vector.We can also recover various matrix norms when V = R m×n .For example, taking G ∼ = O(m) × O(n) to be the group of linear operators of the form , then max filtering with any rank-1 matrix of unit Frobenius norm gives the spectral norm.
Example 5 (Power spectrum).Consider the case where V = L 2 (R/Z) and G is the group of circular translation operators T a defined by T a g(t) := g(t − a) for a ∈ R. (Here and throughout, functions in L 2 will be real valued by default.)Given a template z k of the form z k (t) := cos(2πkt) for some k ∈ N, it holds that A similar choice of templates recovers the power spectrum over finite abelian groups.
Example 6 (Unitary groups).While we generally assume V is a real inner produce space, our theory also applies in the complex setting.For example, consider the case where V = C n and G ≤ U(n).Then V is a 2n-dimensional real inner product space with x, y := Re(x * y), where x * denotes the conjugate transpose of x.As such, U(n) ≤ O(V ) since g ∈ U(n) implies gx, gy = Re((gx) * (gy)) = Re(x * g * gy) = Re(x * y) = x, y .
The max filter bank corresponding to ) is known as complex phase retrieval [4,9,29,70], and over the last decade, several algorithms were developed to solve this inverse problem [22,31,20,72,21,27].In the related setting where V = R d and G = {± id} ≤ O(d), the analogous inverse problem is known as real phase retrieval [4].
Example 8 (Matched filtering).In classical radar, the primary task is to locate a target.Here, a transmitter emits a pulse p ∈ L 2 (R), which then bounces off the target and is received at the transmitter's location with a known direction of arrival.The return signal q is a noisy version of T a p for some a > 0, where T a denotes the translation-by-a operator defined by T a f (t) := f (t − a).Considering the transmitter-to-target distance is a/2 times the speed of light, the objective is to estimate a, which can be accomplished with matched filtering: simply find a for which T a p, q is largest.This is essentially a max filter with V = L 2 (R) and G being the group of translation operators, though for this estimation problem, the object of interest is the maximizer a, not the maximum value [p], [q] .Meanwhile, the maximum value is used for the detection problem of distinguishing noise from noisy versions of translates of p. (This accounts for half of the etymology of max filtering.) Example 9 (Max pooling).In a convolutional neural network, it is common for a convolutional layer to be followed by a max pooling layer.Here, the convolutional layer convolves the input image with several localized templates, and then the max pooling layer downsamples each of the resulting convolutions by partitioning the scene into patches and recording the maximum value in each patch.In the extreme case where the max pooling layer takes the entire scene to be a single patch to maximize over, these layers implement a max filter bank in which V is the image space and G is the group of translation operators.(This accounts for the other half of the etymology of max filtering.)

The complexity of separating orbits
For practical reasons, we are interested in orbit-separating invariants Φ : V → R n of low complexity, which we take to mean two different things simultaneously: (i) n is small (i.e., the map has low sample complexity), and (ii) one may evaluate Φ efficiently (i.e., the map has low computational complexity).
While these notions of complexity are related, they impact the learning task in different ways.In what follows, we study both notions of complexity in the context of max filtering.

Generic templates separate orbits
In this subsection, we focus on the case in which V = R d and G ≤ O(d) is semialgebraic, which we will define shortly.Every polynomial function p : R n → R determines a basic semialgebraic set {x ∈ R n : p(x) ≥ 0}.
By closing under finite unions, finite intersections, and complementation, the basic semialgebraic sets of R n generate an algebra of sets known as the semialgebraic sets in R n .A semialgebraic subgroup G ≤ GL(d) is a subgroup of GL(d) that is also a semialgebraic set in R d×d , e.g., O(d).A semialgebraic function is a function f : R s → R t for which the graph Lemma 10.For every semialgebraic subgroup G ≤ O(d), the corresponding max filtering map Proof.The graph of the max filtering map can be expressed in first-order logic: (To be precise, one should replace our quantifiers over G with the polynomial conditions that define the semialgebraic set G to obtain a condition in first-order logic.)It follows from Proposition 2.2.4 in [12] that the graph is semialgebraic.
Every semialgebraic set A can be decomposed as a disjoint union A = i A i of finitely many semialgebraic sets A i , each of which is homeomorphic to an open hypercube (0, 1) d i (where (0, 1) 0 is a point).The dimension of A can be defined in terms of this decomposition as dim(A) := max In the d-strongly separating case where G = O(d), Theorem 12 implies that n = 2 generic templates suffice to separate orbits.(Of course, any single nonzero template suffices in this case.)Theorem 12 is an improvement to Theorem 1.9 in [37], which gives the condition n ≥ 2d + 1.We obtain the improvement n ≥ 2d/k by leveraging a more detailed notion of strongly separating, as well as the positive homogeneity of max filtering.The proof makes use of a lift-and-project technique that first appeared in [4] and was subsequently applied in [29,73,19,62].
Proof of Theorem 12. Fix n ≥ 2d, and let Z ⊆ (R d ) n denote the set of {z i } n i=1 for which the max filter bank x → { [z i ], [x] } n i=1 fails to separate G-orbits in R d .We will show that Z is semialgebraic with dimension ≤ dn − 1, from which the result follows.To do so, observe that {z i } n i=1 ∈ Z precisely when there exists a witness, namely, (x, [y] for every i ∈ {1, . . ., n}.In fact, we may assume that the witness (x, y) satisfies x 2 + y 2 = 1 without loss of generality since the set of witnesses for {z i } n i=1 avoids (0, 0) and is closed under positive scalar multiplication by Lemma 2(c).This suggests the following lift of Z: Since G is semialgebraic, we have that [y] is a semialgebraic condition for each i by Lemma 10.It follows that L is semialgebraic.Next, we define the projection maps π 1 : and π 2 : ({z i } n i=1 , (x, y)) → (x, y).Then Z = π 1 (L) is semialgebraic by Tarski-Seidenberg (Proposition 2.2.1 in [12]).To bound the dimension of Z, we first observe that and so dim(π −1 2 (x, y)) ≤ n(d − k), since the max filtering map is k-strongly separating by assumption.We use the fact that π 2 (L) is contained in the unit sphere in (R d ) 2 together with Lemma 1.10 in [37] to obtain where the last step is equivalent to the assumption n ≥ 2d/k.
Corollary 13 follows immediately from Theorem 12 and the following lemma: Lemma 14.For every finite subgroup G ≤ O(d), the corresponding max filtering map Since the max filtering map is semialgebraic by Lemma 10, it follows that the left-hand set is also semialgebraic.Since G is finite, the right-hand set is semialgebraic with dimension d − 1, and the result follows.
We would like to know if a version of Corollary 13 holds for all semialgebraic groups, but we do not have a proof of strongly separating for infinite groups in general.This motivates the following problem: Problem 15.
(a) For which semialgebraic groups is the max filtering map k-strongly separating?(b) How many templates are needed to separate orbits for a given group?
We identify a couple of interesting instances of Problem 15.First, we consider the case of complex phase retrieval (as in Example 7), where V = C r and G is the center of U(r).It is known that n = 4r − 4 = 2 dim(V ) − 4 generic templates separate orbits for every r, and this is the optimal threshold for infinitely many r [29], but there also exist 11 templates in C 4 that separate orbits, for example [70].
As another example, consider the case where V = R d and G ∼ = S d is the group of d × d permutation matrices.Then Corollary 13 gives that 2d generic templates separate orbits.However, it is straightforward to see that the templates z j := j i=1 e i for j ∈ {1, . . ., d} also separate orbits, where e i denotes the ith standard basis element.Indeed, take sort(x) to have weakly decreasing entries.Then the first entry equals [z 1 ], [x] , while for each j > 1, the As such, this max filter bank determines sort(x), which is a separating invariant of V /G.Considering Theorem 12, one might suspect that the max filtering map is 2-strongly separating in this case, but this is not so.Indeed, the cone C of sorted vectors in R d has dimension d, and so there exists a subspace H of co-dimension 1 that intersects the interior of C. Select any x in the interior of C and any unit vector v ∈ H ⊥ , and then take y = x + v for > 0 sufficiently small so that y ∈ C. Then which is a semialgebraic set of dimension d − 1, and so the claim follows.

Low-complexity max filtering
In this subsection, we focus on the case in which V ∼ = R d .Naively, one may compute the max filtering map (x, y) → [x], [y] over a finite group G ≤ O(d) of order m by computing x, gy for every g ∈ G and then returning the maximum.This approach costs O(md) operations.Of course, this is not possible when G is infinite, and it is prohibitive when G is finite but large.Interestingly, many of the groups that we encounter in practice admit a faster implementation.In particular, for many quotient spaces V /G, the quotient metric d : V /G × V /G → R is easy to compute, and by Lemma 2(f), the max filtering map is equally easy to compute in these cases: In this subsection, we highlight a few examples of such quotient spaces before considering the harder setting of graphs.

Point clouds
Consider V = R k×n with Frobenius inner product.We can represent a point cloud of n points in R k as a member of V by arbitrarily labeling the points with column indices.In this setting, we identify members of V that reside in a common G-orbit with G ∼ = S n permuting the columns.The resulting quotient metric is known as the 2-Wasserstein distance: where Π(n) denotes the set of n × n permutation matrices and • F denotes the Frobenius norm.The corresponding max filtering map is then given by where conv Π(n) denotes the convex hull of Π(n), namely, the doubly stochastic matrices.By this formulation, the max filtering map can be computed in polynomial time by linear programming.In the special case where k = 1, the max filtering map has an even faster implementation: which can be computed in linearithmic time.

Circular translations
Consider the case where V ∼ = R n is the space of vectors with entries indexed by the cyclic group C n := Z/nZ, and G ∼ = C n is the group of circular translations T a defined by T a f (x) := f (x − a).Then the max filtering map is given by where denotes the circular convolution and R denotes the reversal operator.Thus, the max filtering map can be computed in linearithmic time with the help of the fast Fourier transform.

Shape analysis
In geometric morphometics [53], it is common for data to take the form of a sequence of n landmarks in R k (where k is typically 2 or 3) with a global rotation ambiguity.This corresponds to taking V = R k×n and G ∼ = O(k) acting on the left, and so the max filtering map is given by where • * denotes the nuclear norm.As such, the max filtering map can be computed in polynomial time with the aid of the singular value decomposition.

Separation hierarchy for weighted graphs
Here, we focus on the case in which V is the vector space of real symmetric n × n matrices with zero diagonal and G ∼ = S n is the group of linear isometries of the form A → P AP −1 , where P is a permutation matrix.We think of V /G as the space of weighted graphs on n vertices (up to isomorphism).One popular approach for separating graphs uses messagepassing graph neural networks, but the separation power of such networks is limited by the so-called Weisfeler-Lehman test [75,77,54].For example, message-passing graph neural networks fail to distinguish C 3 ∪ C 3 from C 6 .See [64,43] for surveys of this rapidly growing literature.
As an alternative, we consider a max filtering approach.Given two adjacency matrices A 1 and A 2 , Lemma 2(f) implies that the corresponding graphs are isomorphic if and only if As such, max filtering is graph isomorphism-hard in this setting.Interestingly, there exist A ∈ V for which the map X → [A], [X] can be computed in linearithmic time, and furthermore, these easy-to-compute max filters help with separating orbits.To see this, we follow [2], which uses color coding to facilitate computation by dynamic programming.
Lemma 17.Given n, k ∈ N with n ≥ k, there exists an (n, k)-color coding of size ke k log n .
Proof.We show that N random colorings form a color coding with positive probability.For each i ∈ [N ] and S ∈ [n]  k , we have P{f i (S) = [k]} = k!/k k , and so the union bound gives It suffices to select N so that the right-hand side is strictly smaller than 1.The result follows by applying the bounds n k ≤ n k , k! ≥ (k/e) k , and (1 − 1/t) t < 1/e for t > 1: Algorithm 1 computes the max filter with a small weighted tree using a color coding and dynamic programming.Lemma 17 implies that Algorithm 1 has runtime e O(k log k) n 2 log n, which is linearithmic in the size of the data when k is fixed.Notice that max filtering with the path on k = 4 vertices already separates the graphs C 3 ∪ C 3 and C 6 .Furthermore, using techniques from [2], one can modify Algorithm 1 to max filter with any template graph H on k vertices, though the runtime becomes e O(k log k) n t+1 log n, where t is the treewidth of H.
Letting H(k, t) denote the set of weighted graphs on at most k vertices with treewidth at most t, we have the following hierarchy: Corollary 13 gives that n(n − 1) generic templates from H(n, n − 1) separate all isomorphism classes of weighted graphs on n vertices.It would be interesting to study the separation power of templates of logarithmic order and bounded treewidth.

Stability of max filtering 4.1 Bilipschitz max filter banks
Upper and lower Lipschitz bounds are used to quantify the stability of a mapping between metric spaces, but it is generally difficult to estimate such bounds; see [9,7,17,45,5] for examples from phase retrieval and [79,19,18,6] for other examples.In this subsection, we prove the following: Draw independent random vectors z 1 , . . ., z n ∼ Unif(S d−1 ).With probability ≥ 1 − e −n/(12m 2 ) , it holds that the max filter bank Φ : R d /G → R n with templates {z i } n i=1 has lower Lipschitz bound δ and upper Lipschitz bound n 1/2 .This result distinguishes max filtering from separating polynomial invariants, which do not necessarily enjoy upper or lower Lipschitz bounds [19].In Theorem 18, we may take the embedding dimension to be n = Θ * (m 2 d) with bilipschitz bounds Θ * ( 1 m 2 d 1/2 ) and Θ * (md 1/2 ), where Θ * (•) suppresses logarithmic factors.For comparison, we consider a couple of cases that have already been studied in the literature.First, the case where G = {± id} reduces to the setting of real phase retrieval (as in Example 7), where it is known that there exist n = Θ(d) templates that deliver lower-and upper-Lipschitz bounds 1  4 and 4, say; see equation ( 17) in [9].Notably, these bounds do not get worse as d gets large.It would be interesting if a version of Theorem 18 held for infinite groups, but we do not expect it to hold for infinitedimensional inner product spaces.Case in point, for V = 2 with G = {± id}, it was shown in [17] that for every choice of templates, the map is not bilipschitz.
Another interesting phenomenon from finite-dimensional phase retrieval is that separating implies bilipschitz; see Lemma 16 and Theorem 18 in [9] and Proposition 1.4 in [17].This suggests the following: Problem 19.Is every separating max filter bank Φ : R d /G → R n bilipschitz?
If the answer to Problem 19 is "yes," then Corollary 13 implies that 2d generic templates produce a bilipschitz max filter bank Φ : Theorem 18 follows immediately from Lemmas 20 and 21 below.Our proof uses the following notion that was introduced in [1].We say For the lower Lipschitz bound, fix x, y ∈ R d with [x] = [y], and then for each i ∈ {1, . . ., n}, where the inequality follows from the bound Combining with (2) gives the result.
The following lemma gives that random templates exhibit projective uniformity.
Lemma 21 (cf.Lemma 6.9 in [1]).Select p ∈ (0, 1) and take Draw independent random vectors z 1 , . . ., z n ∼ Unif(S d−1 ).Then {z i } n i=1 exhibits ( pn , δ)projective uniformity with probability ≥ 1 − e −pn/12 .Proof.Put k := pn , let E denote the failure event that {z i } n i=1 does not have (k, δ)projective uniformity, and let N δ denote a δ-net of S d−1 of minimum size.Note that if v is within δ of x, then for every z i , it holds that Thus, we may pass to the δ-net to get where the second inequality applies the union bound and the rotation invariance of the distribution Unif(S d−1 ).A standard volume comparison argument gives |N δ | ≤ ( 2 δ + 1) d .The final probability concerns a sum of independent Bernoulli variables with some success probability q = q(d, δ), which can be estimated using the multiplicative Chernoff bound: provided p > q.Next, we verify that q(d, δ) ≤ p 2 .Denoting g ∼ N(0, I d ), we have where the final inequality uses the facts that | g, e 1 | has half-normal distribution and g 2 has chi-squared distribution with d degrees of freedom.We select t := (2d + 3 log( 4 p )) 1/2 so that the second term equals p 4 , and then our choice (3) for δ ensures that the first term equals p 4 .Overall, we have where the last step applied our assumption that n ≥ 12d p log( 2 δ + 1).

Mallat-type stability to diffeomorphic distortion
In this subsection, we focus on the case in which V = L 2 (R d ) and G is the group of translation operators T a defined by where R denotes the reversal operator defined by Rh(x) := h(−x) and denotes convolution.
(Of course, the supremum of a member of Our motivation for this setting stems from image analysis, in which case d = 2.For a familiar example, consider the task of classifying handwritten digits.Intuitively, each class is translation invariant, and so it makes sense to treat images as members of V /G.In addition, images that are slight elastic distortions of each other should be sent to nearby points in the feature domain.The fact that image classification is invariant to such distortions has been used to augment the MNIST training set and boost classification performance [66].Instead of using data augmentation to learn distortion-invariant features, it is desirable to restrict to feature maps that already exhibit distortion invariance.(Indeed, such feature maps would require fewer parameters to train.)This compelled Mallat to introduce his scattering transform [52], which has since played an important role in the theory of invariant machine learning [15,16,71,39,57].Mallat used the following formalism to analyze the stability of the scattering transform to distortion.
Given a diffeomorphism g ∈ C 1 (R d ), we consider the corresponding distortion operator L g defined by L g f (x) := f (g −1 (x)).It will be convenient to interact with the vector field τ := id −g −1 ∈ C 1 (R d ), since L g f (x) = f (x − τ (x)).For example, if τ (x) = a for every x ∈ R d , then L g is translation by a.In what follows, Jτ (x) ∈ R d×d denotes the Jacobian matrix of τ at x. Theorem 22.Take any continuously differentiable h ∈ L 2 (R d ) for which and are bounded.There exists C(h) > 0 such that for every f ∈ L 2 (R d ) and every diffeomorphism This matches Mallat's bound [52] on the stability of the scattering transform to diffeomorphic distortion, except our bound has no Hessian term.The proof of Theorem 22 follows almost immediately from the following modification of Lemma E.1 in [52], which bounds the commutator between the filter and the distortion by the magnitude of the distortion: Theorem 22, and consider the linear operator Z h defined by Z h f := h f .There exists C(h) > 0 such that for every diffeomorphism g ∈ Assuming Lemma 23 for the moment, we can prove Theorem 22.
Proof of Theorem 22.The change of variables a = g −1 (a ) gives and so the result follows from Lemma 23.
The rest of this section proves Lemma 23.Our proof follows some of the main ideas in the proof of Lemma E.1 in [52].
and so We first bound the second factor.For f ∈ L 2 (R d ), a change of variables gives For x ∈ R d , the fact that Jτ (x and so combining with the above estimate gives It remains to bound To this end, one may verify that K can be expressed as We will bound the L 2 norms of every k(x, •) and k(•, u), and then appeal to Young's inequality for integral operators to bound where First, we analyze k 1 .Letting p 1 : [0, 1] → R d denote the parameterized line segment of constant velocity from (I d − Jτ (u))(x − u) to x − u, we have and so To bound the first factor, let C ∞ (h) > 0 denote a simultaneous bound on the absolute value and 2-norm of (4).To use this, we bound inf t∈[0,1] p 1 (t) 2 from below: which allows us to further bound (6): Next, we analyze k 2 .Since Jτ (x) 2→2 ≤ 1 2 by assumption, Bernoulli's inequality gives Also, the convexity bound Furthermore, we have and so Finally, we analyze k 3 .Put Then letting p 2 : [0, 1] → R d denote the parameterized line segment of constant velocity from r to r + s, we have and so For the first factor of ( 9), we have | det( To bound the second factor of ( 9), we use our bound C ∞ (h) > 0 on (4).To do so, we bound inf t∈[0,1] p 2 (t) 2 from below.First, we note that where p 3 : [0, 1] → R d is the parameterized line segment of constant velocity from u to x.Thus, and so for each t ∈ [0, 1], we have Overall, we have sup Finally, we apply (10) to bound the third factor of (9): We combine these estimates to obtain the following bound on (9): Finally, ( 7), (8), and (11) together imply Importantly, this is a bounded function of x − u that decays like x − u . By integrating the square, this simultaneously bounds the L 2 norm of every k(x, •) and k(•, u) by a quantity of the form C 0 (h) • sup z∈R d Jτ (z) 2→2 .By Young's inequality for integral operators (see Theorem 0.3.1 in [67], for example), it follows that Combining with ( 5) then gives the result with C(h) := 2 d • C 0 (h).

Template selection for classification 5.1 Classifying characteristic functions
In this subsection, we focus on the case in which V = L 2 (R d ) and G is the group of translation operators.Suppose we have k distinct G-orbits of indicator functions of compact subsets of R d .According to the following result, there is a simple classifier based on a size-k max filter bank that correctly classifies these orbits.(This cartoon setting enjoys precursors in [24,23].)Theorem 24.Given compact sets S 1 , . . ., S k ⊆ R d of positive measure satisfying Proof.By compactness, there exists r > 0 such that every S i is contained in the closed ball centered at the origin with radius r.For reasons that will become apparent later, we take B to be the closed ball centered at the origin with radius 3r, and we define Then for every i and j, it holds that In the special case where j = i, this implies and so We consider all j = i in two cases.
Since R1 S i , 1 S j ∈ L 2 (R d ), it holds that the convolution R1 S i 1 S j is continuous, and since S i and S j are compact, the convolution has compact support.Thus, the extreme value theorem gives that the convolution achieves its supremum, meaning there exists a ∈ R d such that Next, the assumptions Indeed, equality in the bound by assumption, this requires S i = S j + a (modulo null sets), which violates the assumption [ Overall, we combine ( 12), (13), and ( 14) to get Case II: j = i and and so we are done.Now suppose then by continuity and compactness, the extreme value theorem produces a ∈ R d such that (This is why we defined B to have radius 3r.)As before, We combine ( 15) and ( 16) to get as claimed.
For each i, assume S i is translated so that it is contained in the smallest possible ball centered at the origin, and let r i denote the radius of this ball.The proof of Theorem 24 gives that each template z i is supported in a closed ball of radius R := 3 max i r i .The fact that these templates are localized bears some consequence for certain image articulation manifolds [35].In particular, for each π : {1, . . ., k} → N ∪ {0}, let M π ⊆ L 2 (R d ) denote the manifold of images of the form where a(i, j) − a(i , j ) > 4R ∀(i, j) = (i , j ).
Thanks to the 4R spacing, each translate of each template interacts with at most one component of the image, and so for every f ∈ M π , it holds that In particular, the same max filter bank can be used to determine the support of π.As an example, if some multiset of characters are typed on a page in a common font and with sufficient separation, then the max filter bank from Theorem 24 that distinguishes the characters can be used to determine which ones appear on the page.

Classifying mixtures of stationary processes
In this subsection, we focus on the case in which V = R n and G ∼ = C n is the group of circular translation operators.A natural G-invariant probability distribution is a multivariate Gaussian with mean zero and circulant covariance, and so we consider the task of classifying a mixture of such distributions.One-dimensional textures can be modeled in this way, especially if the covariance matrix has a small bandwidth so that distant pixels are statistically independent.A standard approach for this problem is to estimate the first-and second-order moments given a random draw.As we will soon see, one can alternatively classify with high accuracy by thresholding a single max filter.In what follows, we make use of Thompson's part metric on the set of positive definite matrices: We also let A k denote the leading k × k principal submatrix of A.
Theorem 25.Fix C > log 2, take n, w ∈ N such that k := n/2 ≥ w, and consider any positive definite A, B ∈ R n×n that are circulant with bandwidth w and satisfy There exists z ∈ R n supported on an interval of length k and a threshold θ ∈ R such that for every mixture M of N(0, A) and N(0, B), then given x ∼ M, the comparison Proof.First, we observe that d Without loss of generality, we may assume λ max (A , and define z ∈ R n to be supported in its first k entries as the subvector z Consider , where each T a z, x 1 has Gaussian distribution with mean zero and variance (T a z) A(T a z), which in turn equals z Az since A is circulant.Denoting Z ∼ N(0, 1), a union bound then gives for t ≥ 0. This failure probability is o n→∞;c 1 (1) by taking t := c 1 • z Az • log n for any c 1 > 2. Next, consider x 2 ∼ N(0, B), and take any subset In this case, each T a z, x 2 has Gaussian distribution with mean zero and variance z Bz.Furthermore, since k ≥ w, these random variables have pairwise covariance zero, and since they are Gaussian, they are therefore independent.For independent Z 1 , . . ., Z k ∼ N(0, 1) and t ≥ 0, we have P max i∈{1,...,k} where the last step follows from a standard lower bound on the tail of a standard Gaussian distribution.Take t := √ c 2 log k to get P max i∈{1,...,k} 2 ) −1/2 < 2, and then (18) implies Thus, θ 1 ≤ θ 2 , and so the result follows by taking θ := (θ Given a mixture of k Gaussians with covariance matrices that pairwise satisfy ( 17), then we can perform multiclass classification by a one-vs-one reduction.Indeed, the binary classifier in Theorem 25 can be applied to all k 2 pairs of Gaussians, in which case we correctly classify the latent mixture component with high probability (provided k is fixed and n → ∞).
Interestingly, max filters can also distinguish between stationary processes with identical first-and second-order moments.For example, x ∼ Unif({±1} n ) and y ∼ N(0, I n ) are both stationary with mean zero and identity covariance.If z ∈ R n is a standard basis element, then with high probability, it holds that This indicates that max filters incorporate higher-order moments.

Subgradients
In this subsection, we focus on the case in which V = R d and G is a closed subgroup of O(d).The previous subsections carefully designed templates to classify certain data models.For real-world classification problems, one is expected to train a classifier on a given training set of labeled data.To do so, one selects a parameterized family of classifiers and then locally minimizes some notion of training loss over this family.This is feasible provided the classifier is a differentiable function of the parameters.We envision a classifier in which the first layer is a max filter bank, and so this subsection establishes how to differentiate a max filter with respect to the template.This is made possible by convexity; see Lemma 2(d).
Every convex function f : R d → R has a subdifferential ∂f : R d → 2 R d defined by For a fixed x ∈ R d and u ∈ ∂f (x), it is helpful to interpret the graph of z → f (x) + z − x, u as a supporting hyperplane of the epigraph of f at x.For example, the absolute value function over R has the following subdifferential: Indeed, this gives the slopes of the hyperplanes that support the epigraph of Following [74], we will determine the subdifferential of the max filtering map by first finding its directional derivatives.In general, the directional derivative of f at x in the direction of v = 0 is given by The second equality is a standard result; see for example Theorem 23.4 in [61].It will be convenient to denote the set G(x, y) := arg max g∈G x, gy .
Observe that G(x, y) is a closed subset of G since the map g → x, gy is continuous.
Proof.For each t > 0 and g ∈ G(x, y), we have For each t > 0, select g t ∈ G(x + tv, y).Then We rearrange and combine ( 19) and ( 20) to get for all t > 0. Select a sequence t n → 0 + such that g tn converges to some g ∈ G (by passing to a subsequence if necessary).Then continuity implies Proof.First, we claim that for every g ∈ G(x, y), the vector gy is in the subdifferential ∂ [•], [y] (x).Indeed, for every h ∈ R d , we have

Random templates and limit laws
While the previous subsection was concerned with the differentiability needed to optimize a feature map, it is well known that random feature maps suffice for various tasks (e.g., Johnson-Lindenstrauss maps [46] and random kitchen sinks [60]).In fact, our bilipschitz result (Theorem 18) uses random templates.As such, one may be inclined to use random templates to produce features for classification.
In this subsection, we focus on the case in which V = R d and G ∼ = S d is the group of d × d permutation matrices.Consider a max filter bank consisting of independent standard gaussian templates z 1 , . . ., z n ∈ R d .Then When d is large, we expect the vectors sort(z i ) to exhibit little variation, and so we are inclined to perform dimensionality reduction.Figure 1 illustrates that the principal components of {sort(z i )} n i=1 exhibit a high degree of regularity.To explain this regularity, select Q : (0, 1) → R so that Q −1 is the cumulative distribution function of the standard normal distribution.Take z ∈ N(0, I d ) and put s := sort(z).Denoting p i := i d+1 and q i := 1 − p i , then Section 4.6 in [30] gives The principal components of {sort(z i )} n i=1 approximate the top eigenvectors of the covariance matrix, which in turn approximate discretizations of eigenfunctions of the integral operator with kernel K : (0, 1) 2 → R defined by The following result expresses these eigenfunctions in terms of the probabilist's Hermite polynomials, which are defined by the following recurrence: As such, instead of max filtering with independent gaussian templates, one can efficiently capture the same information by taking the inner product between sort(x) and discretized versions of the eigenfunctions p n • Q.To reduce the dimensionality, one can simply use fewer eigenfunctions.To prove Theorem 28, we will use the following lemma; here, ϕ : R → R denotes the probability density function of the standard normal distribution. x ).The proof of Lemma 29 follows quickly from the recurrence (22).
Proof of Theorem 28.We compute L(p n • Q) by splitting the integral: x 0 yQ (y)p n (Q(y))dy For both I 1 and I 2 , we integrate by parts with where the middle step follows from Lemma 29(a).Then Lemma 29(b) gives These combine (and mostly cancel) to give

Numerical examples
In this section, we use max filters as feature maps for various real-world learning tasks.
Example 30 (Voting districts).The one person, one vote principle insists that in each state, different voting districts must have nearly the same number of constituents.This principle is enforced with the help of a decennial redistricting process based on U.S. Census data.Interestingly, each state assembly applies its own process for redistricting, and partisan approaches can produce unwanted gerrymandering.Historically, gerrymandering is detected by how contorted a district's shape looks; for example, the Washington Post article [44] uses a particular geometric score to identify the top 10 most gerrymandered voting districts of the 113th Congress.
As an alternative, we visualize the distribution of district shapes with the help of max filtering.The shape files for all voting districts of the 116th Congress are available in [68].We center each district at the origin and scale it to have unit perimeter, and then we sample the boundary at n = 50 equally spaced points.This results in a 2 × 50 matrix representation of each district.However, the same district shape may be represented by many matrices corresponding to an action of O(2) on the left and an action of C n that cyclically permutes columns.Hence, it is appropriate to apply max filtering with G ∼ = O(2)×C n .It is convenient to identify R 2×50 with C 50 so that the corresponding max filter is given by which can be computed efficiently with the help of the fast Fourier transform.We max filter with 100 random templates to embed these districts in the feature space R 100 , and then we visualize the result using PCA; see Figure 2.
Interestingly, the principal components of this feature domain appear to be interpretable.The first principal component (given by the horizontal axis) seems to capture the extent to which the district is convex, while the second principal component seems to capture the eccentricity of the district.Six of the ten most gerrymandered districts from [44] are drawn in red, which appear at relatively extreme points in this feature space.The four remaining districts (NC-1, NC-4, NC-12, and PA-7) were redrawn by court order between the 113th and 116th Congresses; the new versions of these districts are drawn in blue.Unsurprisingly, the redrawn districts are not contorted, and they appear at not-so-extreme points in the feature space.
Example 31 (ECG time series).An electrocardiogram (ECG) uses electrodes placed on the skin to quantify the heart's electrical activity over time.With ECG data, a physician can determine whether a patient has had a heart attack with about 70% accuracy [51].Recently, Makimoto et al. [51] trained a 6-layer convolutional neural network to perform this task with about 80% accuracy.The dataset they used is available at [14], which consists of (12 + 3)lead ECG data sampled at 1 kHz from 148 patients who recently had a heart attack (i.e., myocardial infarction or MI ) and from 141 patients who did not.As an alternative to convolutional neural networks, we use max filters to classify patients based on one second of ECG data (i.e., roughly one heartbeat).In particular, we read the first t = 1000 samples of all 15 leads to obtain a 15 × t matrix X.We then lift this matrix to a 15 × w × (t − w + 1) tensor with w = 30, where each 15 × w slice corresponds to a contiguous 15 × w submatrix of X.Then we normalize each 1 × w × 1 subvector to have mean zero to get the tensor Y ; this discards any trend in the data.We account for time invariance by taking G to be the order-(t − w + 1) group of circular permutations of the 15 × w slices of Y .Using this group, we max filter with n = 5 different templates and then classify with a support-vector machine.We constrain each template to be supported on a single slice, and we train these templates together with the support-vector machine classifier by gradient descent to minimize hinge loss.
Following [51], we form the test set by randomly drawing 25 examples from the MI class and 25 examples from the non-MI class, and then we form the training set by randomly drawing 108 non-test set examples from each class.We perform 10 trials of this experiment, and each achieve between 74% and 84% accuracy on the test set.In particular, this is competitive with the 6-layer convolutional neural network in [51] despite accessing only a fraction of the data.In Figure 3, we illustrate what the classifier considers to be the most extreme examples in the test set.The time segments that align best with the trained templates typically cover the heartbeat portion of the ECG signal.
Example 32 (Textures).In this example, we use max filtering to classify various textures, specifically, those given in the Kylberg Texture Dataset v. 1.0, available in [49].This dataset consists of 28 classes of textures, such as blanket, grass, and oatmeal.Each class consists Figure 3: We train max filter templates and a support-vector machine classifier on electrocardiogram data to distinguish between patients who have had a heart attack from those who have not.Above, we plot the most extreme examples in the test set (those with heart attacks on the left, and those without on the right).The blue windows illustrate the time segments of width w = 30 that align best with the templates.See Example 31 for details. of 160 distinct grayscale images.(We crop these 576 × 576 images down to 256 × 256 for convenience.)A few classifiers in the literature have been trained on this dataset [50,3], but in order to achieve a high level of performance, they leverage domain-specific feature maps, augment the training set with other image sets, or simply initialize with a pre-trained network.As an alternative, we apply max filtering with random features, and we succeed with much smaller training sets.For an honest comparison, we train different classifiers on training sets of different sizes, and the results are displayed in Table 1.(Here, the test set consists of 32 points from each class.)We describe each classifier in what follows.
First, we consider a convolutional neural network (CNN) with a standard architecture.We pass the 256 × 256 image through a convolutional layer with a 3 × 3 kernel before max pooling over 2 × 2 patches.Then we pass the result through another convolutional layer with a 3 × 3 kernel, and again max pool over 2 × 2 patches.We end with a dense layer that maps to 28 dimensions, one for each class.We train by minimizing cross entropy loss against the one-hot vector representation of each label.This model has 1.7 million parameters, which enables us to achieve perfect accuracy on the training set.
As an alternative, we apply linear discriminant analysis (LDA).Training this classifier involves a massive matrix operation, and this requires too much memory in our setting when the training set has 128 points per class.This motivates the use of principal component analysis to reduce the dimensionality of each 256×256 image to k = 25 principal components.The resulting classifier (PCA-LDA) works for larger training sets, and even exhibits improved accuracy when the training set is not too small.For our method, we apply the same PCA-LDA method to a max filtering feature domain.Surprisingly, we find that modding out by the entire group of pixel permutations (i.e., just sorting the pixel values) performs reasonably well as a feature map.Our method implements a multiscale version of this observation.For each ∈ {0, 1, . . ., 8}, one may partition the 2 8 × 2 8 image into 4 8− square patches of width 2 .We perform max filtering with the group (S 4 ) 4 8− of all patch-preserving pixel permutations for each ∈ {2, . . ., 8}.For simplicity, we are inclined to draw templates at random, but for computational efficiency, we simulate random templates with the help of Theorem 28.In particular, for each ∈ {2, . . ., 8} and n ∈ {0, . . ., 5}, we sort the pixels in each 2 × 2 patch and take their inner products with a discretized version of p n • Q and sum over the patches.This maps each 2 8 × 2 8 image to a 42-dimensional feature vector.As before, we apply PCA with k = 25 and then train an LDA classifier on the result.In the end, this max filtering approach significantly outperforms the other classifiers we considered, as seen in Table 1.

Discussion
Max filtering offers a rich source of invariants that can be used for a variety of machine learning tasks.In this section, we discuss several opportunities for follow-on work.
Separating.In terms of sample complexity, Corollary 13 gives that a generic max filter bank of size 2d separates all G-orbits in R d provided G ≤ O(d) is finite.If G ≤ O(d) is not topologically closed, then no continuous invariant (such as a max filter bank) can separate all G-orbits.If G ≤ O(d) is topologically closed, then by Theorem 3.4.5 in [56], G is algebraic, and so Theorem 12 applies.We suspect that max filtering separates orbits in such cases, but progress on this front will likely factor through Problem 15(a).For computational complexity, how well does the max filtering separation hierarchy (1) separate isomorphism classes of weighted graphs?Judging by [2], we suspect that max filtering with a template of order k and treewidth t can be computed with runtime e O(k) n t+1 log n, which is polynomial in n when k is logarithmic and t is bounded.Which classes of graphs are separated by such max filters?Also, can max filtering be used to solve real-world problems involving graph data, or is the exponent too large to be practical?
Bilipschitz.The proof of Lemma 20 contains our approach to finding Lipschitz bounds on max filter banks.Our upper Lipschitz bound follows immediately from the fact that each individual max filter is Lipschitz, and it does not require the group to be finite.This bound is optimal for G = O(d), but it is known to be loose for small groups like G = {± id}.We suspect that our lower Lipschitz bound leaves a lot more room for improvement.In particular, we use the pigeonhole principle to pass from a sum to a maximum so that we can leverage projective uniformity.A different approach might lead to a tighter bound that does not require G to be finite.More abstractly, we suspect that separating implies bilipschitz, though the bounds might be arbitrarily bad; see Problem 19.Randomness.Our separating and bilipschitz results (Corollary 13 and Theorem 18) are not given in terms of explicit templates.Meanwhile, our Mallat-type stability result (Theorem 22) requires a localized template, and we can interpret the templates used in our weighted graph separation hierarchy (1) as being localized, too.We expect that there are other types of structured templates that would reduce the computational complexity of max filtering (much like the structured measurements studied in compressed sensing [63,59,47] and phase retrieval [10,38,13,40]).It would be interesting to prove separating or bilipschitz results in such settings.More generally, can one construct explicit templates for which max filtering is separating or bilipschitz for a given group?Going the other direction, the reader may have noticed that the plots in Figure 1 deviate from each other at the edges of the interval.We expect that such deviations decay as the dimension grows, but this requires further analysis.How can one analyze the behavior of random max filters for other groups?
Max filtering networks.Example 32 opens the door to a variety of innovations with max filtering.Here, instead of fixing a single group to mod out by, we applied max filtering with a family of different groups.We selected permutations over patches due to the computational simplicity of using Theorem 28, and we arranged the patches in a hierarchical structure so as to capture behavior at different scales.Is there any theory to explain the performance of this architecture?Are there other useful architectures in this vicinity, perhaps by combining with neural networks?
i d i .(It does not depend on the decomposition.)Definition 11.Given a semialgebraic subgroup G ≤ O(d), we say the corresponding max filtering map [•], [•] : R d × R d → R is k-strongly separating if for every x, y ∈ R d with [x] = [y], it holds that dim z ∈ R d : [z], [x] = [z], [y] ≤ d − k.As an example, consider the case where G = O(d).Then [z], [x] = [z], [y] holds precisely when z x = z y , i.e., z = 0 or [x] = [y].Thus, the max filtering map is d-strongly separating in this case.Theorem 12. Consider any semialgebraic subgroup G ≤ O(d) with k-strongly separating max filtering map

Algorithm 1 :
Max filtering with a weighted tree template by color coding Data: Weighted tree with vertex set [k] (in post-order traversal order) and with adjacency matrix A ∈ R k×k , and (n, k)-color coding {f i } N i=1 Input: Weighted graph with adjacency matrix B ∈ R n×n Result: Max filter [ Ã], [B] , where Ã Case I: j = i and |S j | ≤ |S i |.Considering (12), it suffices to bound [1 S i ], [1 S j ] .Letting R denote the reversal operator defined by Rf (x) : [z], [x] ≷ θ correctly classifies the latent mixture component of x with probability 1 − o n→∞;C (1).

Figure 1 :
Figure 1: (left) Draw n = 10, 000 independent standard gaussian random vectors in R d with d = 1, 000, sort the coordinates of each vector, and then plot the top 6 eigenvectors of the resulting sample covariance matrix.(right) Discretize and plot the top 6 eigenfunctions described in Theorem 28.

Figure 2 :
Figure 2: Visualization of voting districts of the 116th Congress obtained by max filtering and principal component analysis.See Example 30 for details.
where s k : R n → R returns the kth smallest entry of the input.In what follows, we denote {z i } n i=1 projective uniformity.Then the max filter bank Φ : R d /G → R n with templates {z i } n i=1 has lower Lipschitz bound δ and upper Lipschitz bound {z i } n i=1 F .Proof.The upper Lipschitz bound follows from Lemma 2(g): F := ( n i=1 z i 2 ) 1/2 .Lemma 20.Fix a finite subgroup G ≤ O(d) and suppose {z i } n i=1 ∈ (R d ) n exhibits ( n |G| 2 , δ)- consider the map p : i → (g i , h i ), and select (g, h) ∈ G 2 with the largest preimage.By pigeonhole, we have |p −1 (g, h)| ≥ n |G| 2 =: k, and so n i=1

Table 1 :
Accuracy on test set versus size of training set (baseline = 0.035, see Example 32)