Low Dimensional Invariant Embeddings for Universal Geometric Learning

This paper studies separating invariants: mappings on $D$ dimensional domains which are invariant to an appropriate group action, and which separate orbits. The motivation for this study comes from the usefulness of separating invariants in proving universality of equivariant neural network architectures. We observe that in several cases the cardinality of separating invariants proposed in the machine learning literature is much larger than the dimension $D$. As a result, the theoretical universal constructions based on these separating invariants is unrealistically large. Our goal in this paper is to resolve this issue. We show that when a continuous family of semi-algebraic separating invariants is available, separation can be obtained by randomly selecting $2D+1 $ of these invariants. We apply this methodology to obtain an efficient scheme for computing separating invariants for several classical group actions which have been studied in the invariant learning literature. Examples include matrix multiplication actions on point clouds by permutations, rotations, and various other linear groups. Often the requirement of invariant separation is relaxed and only generic separation is required. In this case, we show that only $D+1$ invariants are required. More importantly, generic invariants are often significantly easier to compute, as we illustrate by discussing generic and full separation for weighted graphs. Finally we outline an approach for proving that separating invariants can be constructed also when the random parameters have finite precision.

The goal of many machine learning tasks is to learn an unknown function which has some known symmetries.For example, consider the task of classifying 3D objects which are discretized as point clouds.In this scenario the unknown function maps a point cloud to a discrete set of labels.Depending on the type of 3D objects we are interested in, this function could be invariant to translations, rotations, reflections, scaling and/or affine transformations.It is also typically invariant to permutation-reordering the points does not impact the underlying object.
Another example comes from physics simulations, where we aim to learn a function which is given a point cloud at time t = 0, which represents the positions of a collection of physical entities such as particles or planets, and maps them to their positions at time t = 1.This map is equivariant with respect to translations, orthogonal transformations (and Lorenz transformations in the relativistic setting) and permutations: applying these transformations to the input will lead to a corresponding transformation of the output.
In machine learning approaches, an approximation for the unknown function is sought for, from within a parametric hypothesis class of functions.Equivariant machine learning considers hypothesis classes which by construction have some or all of the symmetries of the unknown functions.For example, there are multiple neural network architectures for point clouds which are equivariant (or invariant) to permutations [59,63,47] and/or geometric transformations such as rotations [34,55,17], orthogonal transformations [9,51,61], or Lorenz transformations [11,24,28].
Perhaps the most famous result in the theoretical study of general neural networks is the fact that fully connected ReLU networks can approximate any continuous function [45].Analogously, it is desirable to devise invariant (or equivariant) neural networks which are universal in the sense that they can approximate all continuous invariant (or equivariant) functions.This question has been studied in many works such as [25,62,46,12,22,33,50,39,52].
Universality results for invariant networks typically consider functions which can be written as where V is a Euclidean space, G is a group acting on it, f inv is a G-invariant continuous function, f general is continuous, and so f is G-invariant and continuous.If f general comes from a function space rich enough to approximate all continuous functions (typically a fully connected neural network), and f inv is a continuous invariant mapping which separates orbits-which means that f inv (x) = f inv (y) if and only if x = gy for some g ∈ G, then invariant universality is guaranteed (see e.g,, Proposition 1.3).As discussed in [57], orbit separating invariants can also be used to construct universal equivariant networks.Thus the question of invariant and equivariant universality is to a large extent reduced to the related question of invariant separation, which is the focus of this paper.
We note that invariant mappings on V induce well defined mappings from V /G to some R m , and separation of invariant mappings is equivalent to injectivity of this map.In practice, our mappings will all be continuous, and often, their inverses will also be continuous (though we do not explore this issue in the current paper).For this reason we will informally refer to separating invariant mappings as invariant embeddings.
In the equivariant learning literature, orbit separating invariant functions are typically supplied by classical invariant theory results, which in many cases can characterize all polynomial invariants of a given group action through a finite set of invariant polynomial generators.However, the number of generators is often unrealistically large.For example, the classic point net architecture [47] is a permutation invariant neural network which operates on point clouds in R d×n .Proofs of the universality of this network are typically based on its ability to approximate power sum polynomials, which are generators of the ring of permutation invariant polynomials (see e.g., [52,63,38]).However, while in the experimental setup reported in [47], n = 1024, d = 3 and the network creates 1024 invariant features, the number of power sum polynomials is n+d n , which amounts to over 180 million invariant features in this case.
There are various mathematical results from different disciplines which suggest that the number of separating invariants need not be larger than roughly twice the dimension of the domain (in particular, in the example discussed above this would mean that only 2 dim(R d×n ) = 2n • d = 2048 permutation invariant features would be needed, rather than 180 million).For example, in invariant theory it is known that (for groups acting on affine varieties over algebraically closed fields by automorphisms), while the number of generators of the invariant ring is difficult to control, there are separating sets whose size is 2D + 1, where D is the Krull dimension of the invariant ring [18,20].Additionally, continuous orbit separating mappings on V can be identified with continuous injective mappings on the quotient space V /G.Thus if V /G is a smooth manifold Whitney's embedding theorem shows that it can be easily be (smoothly) embedded into R 2D+1 , where D is the dimension of V /G.(With more work, this can be reduced to 2D.)A similar theorem holds under the weaker assumption that V /G is a compact metric space with topological dimension D ( [44], page 309).Additionally, it is a common assumption (e.g., [3,53,36] ) that the data of interest in a machine learning task typically resides in a 'data manifold' M ⊆ V whose dimension D M (the 'intrinsic dimension') is significantly smaller than the dimension of V (the 'ambient dimension').This assumption is at the root of many dimensionality reduction techniques such as auto-encoders, random linear projections, PCA, etc.In this scenario, we expect to achieve orbit separation with only ≈ 2D M invariants.
Based on the discussion above, we formulate the first goal of this paper: First Goal: Provide an algorithm for computing ≈ 2D M separating invariants for group actions on a 'data manifold' of intrinsic dimension D M .
Figure 1 illustrates our first goal and our discussion so far (full details of the experiment are given in Section 5).We take two lines in high dimensional space, R 3×n , (colored orange and blue) and apply many random permutations from S n to the points on both lines thus obtaining many orange and blue lines.(In (a) we visualize this data projected down to R 3 .)In (b) we see the image of these lines under a permutation invariant separating mapping to R 3 .Due to invariance all orange (respectively blue) lines are mapped to a single orange (or blue) curve in R 3 .Due to separation these curves do not intersect.It is possible to achieve separation with only three coordinates since the intrinsic dimension of the lines is D M = 1.The first row in the tables in Figure 1(c) shows that applying a standard neural networks architecture to the invariant curves in (b) can accurately classify points according to the line from which they originated.The other rows show that when the intrinsic dimension D M of the data is increased, more invariant features are needed, where 2D M + 1 features typically give accurate results.
Dimensionality reduction by linear projections In Corollary1.9 we provide a very general tool to achieve our first goal: with some minor assumptions, we show that if a finite set of N separators is known, then the number of separators can be reduced to 2D M + 1 by taking 2D M + 1 random linear combinations of the original separators.
While we are not aware of a result as general as Corollary1.9 in terms of the domains and groups which it can handle, we note that the idea of dimension reduction of separators using random linear projections is not new.It is used in many of the proofs mentioned above which guarantee ≈ 2D M separators in the algebraic setting (see e.g., [20]), and was used for specific group actions in [6] and [13].Random projections are also used in the Whitney embedding theorem.
A significant computational drawback of the linear projection technique is that it still necessary to compute the original large set of separators before the linear projection.Thus the total complexity of computing the separators is not improved by taking linear projections.Accordingly, we formulate our second goal, which is in fact just a refinement of the original goal Second goal (refinement of first goal): Provide an efficient, polynomial time algorithm for computing ≈ 2D M separating invariants for group actions on a 'data manifold' of intrinsic dimension D M .In our setting we do this by starting with some continuous family of maps such that this entire family is separating, and show that we can separate with only ≈ 2D M randomly chosen elements from this family.

Main results
Our main result in this paper is a methodology for addressing the second goal and efficiently computing 2D M + 1 separating invariants for several classical group actions which have been studied in the invariant machine learning community.Our methodology is inspired by the results in [5], which uses tools from real algebraic geometry to show orbit separation in the context of the phase retrieval problem.Our results, formally stated in Theorem 1.7, can be seen as a generalization of these results to general groups.To illustrate the theorem and its usefulness we will give an example which will later be discussed in more detail in Subsection 2.5.
Example 0.1.Let M denote the collection of all d × n matrices which have rank d (assume that d ≤ n).This is a set of dimension by multiplication from the left.A natural way to construct invariants for this group action is to pick a subset I ⊆ {1, . . ., n} of d indices and considering the function X → det(X I ), where X I is the d × d matrix obtained by choosing the d columns of X indexed by I.In fact, it is known that the functions det(X I ) are generators of the invariant ring, and as a result are separators.The trouble is that the number of subsets of size d, and so the number of generators, is a prohibitive n d .Using Corollary1.9, we can reduce the number of separators by choosing w (1) , . . ., w (2nd+1) random vectors in R ( n d ) and considering functions of the form However computing each one of these invariant still has complexity of ∼ n d .Out methodology to improve upon this issue uses a family of invariants parameterized by the continuous 'weight matrix' W : we note that for every W ∈ R n×d the function is SL(d) invariant.Additionally, we note that every generator det(X I ) is of the form det(X I ) = det(XW I ) for an appropriate choice of a n×d matrix W I .In particular this means that the value obtained by det(XW ) for all possible W determines X uniquely, up to SL(d) equivalence.Theorem 1.7 shows that in this scenario, if we simply choose random W (1) , . . ., W (2nd+1) then the functions X → det(XW (j) ), j = 1, . . ., 2nd + 1 will be invariant and separating.In contrast to the separating invariants in (2), the complexity of computing each one of these invariants is polynomial in n, d.
Informally, Theorem 1.7 can be stated as follows: Suppose we can find a family of invariants p(•, w) parameterized by the continuous 'weight vector' w, such that every orbit pair is separated by some w.Then for almost every random selection of vectors w i with i = 1, . . ., 2D M + 1, the invariants x → p(x, w i ) will separate orbits.In a sense, the usefulness of Theorem 1.7 is that it reduces the problem of separating all points with a finite number of invariants to the problem of separating pairs of points with a continuous family of invariants.One could draw a vague analogy here to the usefulness of the Stone-Weierstrass theorem in reducing approximation proofs to separation of pairs of points.
Roughly speaking, the assumptions needed for this theorem are (i) that M is a semi-algebraic set-that is, a set defined by polynomial equality and inequality constraints, and (ii) that the function p(x, w) is a semialgebraic function.This is a rather large class of functions which includes polynomials, rational functions and continuous piece-wise linear functions.
The usefulness of Theorem 1.7 for obtaining efficient separating invariants was illustrated in Example 0.1.In Section 2 we show similar applications for several other group actions: we find 2D M + 1 orbit separating invariants for the action of permutations on point clouds which we discussed above.These invariants are continuous piece-wise linear functions obtained as a composition of the 'sort' function with a linear function from the left and right.The complexity of computing each invariant is (up to a logarithmic factor) linear in the ambient dimension.Even when we are interested in separation in all of M = R d×n , so that D M = n • d, the number of invariants we need for separation is significantly lower than the n+d n power sum polynomials discussed above.This advantage becomes more pronounced when the 'data manifold' M is low dimensional.
Similarly, we construct 2D M +1 separating invariants for the action orthogonal transformations O(d) and special orthogonal transformations SO(d), on d by n point clouds, which can be compared with the standard separating invariants used for these group actions which have cardinality of n 2 and ∼ n d respectively.We also construct 2D M + 1 separating invariants for Lorenz transformations (and other isometry groups), the general linear group, and for scaling and translation.For several of the latter results we need to assume that the 'data manifold' M contains only full rank point clouds.A summary of these results, and the complexity of computing each invariant, is given in Table 1.
Generic orbit separation As suggested in several works (e.g., [46,57,49]) the notion of orbit separating invariants can be replaced with a weaker notion of generically orbit separating invariants-that is, invariants defined on M which are separating outside a subset N of strictly smaller dimension.This weaker notion Group Domain separators complexity num separators num generators S n R d×n ⟨w (2) , sort(X T w (1)  A more significant computational advantage in settling for generic invariants is that each invariant can usually be computed more efficiently.We exemplify this by considering graph valued permutation invariant functions: there is no known algorithm to separate graphs (up to permutation equivalence) in polynomial time [27].Correspondingly, while we can easily use our methodology to find a small number of separating invariants, the computational price of computing these invariant is prohibitive.On the other hand, it is known that for 'generic graphs' [21,2] separating graphs is not difficult, and correspondingly we are able to find a small number of generically separating invariants which can be computed efficiently.

Related work
Phase retrieval and generalizations As mentioned above, our results were inspired by orbit separation results in the phase retrieval literature.Phase retrieval is the problem of reconstructing a signal x in C n (or R n ), up to a global phase factor, from magnitudes of linear measurements without phase information |⟨x, w i ⟩|, i = 1, . . ., m, where w i are complex (or real) n dimensional vectors.In our terminology, the goal is to find when these measurements are separating invariants with respect to the action of the group of unimodular complex number S 1 (or real unimodular numbers {−1, 1}).
In [5] it is shown that m = 2n − 1 random linear measurements in the real setting, or m = 4n − 2 random linear measurements in the complex setting, are sufficient to define a unique reconstruction of the signal up to global phase.In [16], the number of measurements needed for attaining separation in the complex setting is slightly reduced to m = 4n − 4.
In [23], conjugate phase retrieval is discussed: here the vectors w i are real measurements of complex signals in C n .These measurements are invariant with respect to global phase multiplication and also complex conjugation, and it is shown that 4n − 6 generic measurements are separating with respect to this group action.
The separation results obtained for real and conjugate phase retrieval, are equivalent to saying that on the space of d × n real matrices X, around ∼ 2dn generic measurements of the form ∥Xw i ∥ are separating with respect to the action of O(d), for the cases d = 1 (real phase retrieval) and d = 2 (conjugate phase retrieval).In Subsection 2.2 we use our methodology to show that for d ≥ 2, separation with respect to the action of O(d) can be obtained by choosing 2nd + 1 random measurements of the form ∥Xw i ∥ 2 i = 1, . . ., 2nd + 1.We note that this result is in essence not new, as it can be derived easily from results on the recovery of rank one matrices from rank one linear measurements (see [49], Theorem 4.9 ) There are two natural ways to generalize the separation results of complex phase retrieval: the first is to consider measurements ∥Xw i ∥ with w i ∈ C d and X ∈ C d×n .These measurements are invariant with respect to the action of d × d unitary matrices on C d×n by multiplication from the left.In [32] it is shown that 4d(n − d) such generic measurements are separating with respect to the unitary action.For d = 1 this coincides with the 4n − 4 invariants needed for complex phase retrieval.
An alternative generalization of complex phase retieval is searching for separating invariants for the action of SO(d) on R d×n .Complex phase retrieval gives us separating invariants for the special case d = 2, using the identifications S 1 ∼ = SO(2) and C ∼ = R 2 .In Subsection 2.3 we note that when d > 2 separation is not possible with measurements which involve only norms and linear operations, but can be achieved by adding an additional 'determinant term' to the norm based measurements.
While our results on invariant separation for SO(d) are new (to the best of our knowledge), we feel that the main novelty in this work is in the generalization of the basic real algebraic geometry arguments used in the phase retrieval proofs to a very general class of real group actions (Theorem 1.7) and invariant functions (semi-algebraic mappings), and in the ability to apply this same methodology to multiple group actions (see Table 1).
Finally, we note that the results quoted above show that it is often possible to get generic separation even when the number of invariants is slightly smaller than the 2D M + 1 promised by our Theorem 1.7.We do not pursue the optimal cardinality for separation in this paper for two reason.First, this typically requires a case-by-case analysis and in this paper we try to focus on highlighting a general principle that can be applied to a wide number of settings.Secondly, ultimately the number of separating invariants is not likely to be smaller than the dimension of M/G, and in most applications we are interested in dim(M/G) ≈ dim(M), so that ultimately we do not expect an improvement in separation cardinality by more than a factor of 2.
Permutation Invariant Machine Learning Obtaining O(n • d) separating invariants for the action of the permutation group S n on R d×n is trivial when d = 1, as discussed in e.g., [58].When d > 1, a separating set of invariants is suggested in [6] which is a composition of row-wise sorting with linear transformations from the left and the right.They show that this construction is separating when the intermediate dimension is very high (larger than n!), and that whenever separation is achieved, this map is also Bi-Lipschitz.We will show that in fact the intermediate dimension need only be 2nd + 1 (or lower when the 'data manifold' has lower dimension), and that the second, larger matrix in their construction can have a certain sparse structure which was previously used in [65] (see Remark 2.3).We note that when d = 2 the sufficiency of ∼ n and even ∼ log(n) measurements was shown in [40,48] Recent results Shortly after the first preprint of this manuscript appeared, the authors of [14,42,41] proposed invariants called max filters which are defined for all Hilbert spaces with isometric group actions.For finite dimensional Hilbert spaces these invariants were shown to be separating using our Theorem 1.7.Max filters can thus be suggested as alternative separating invariant to those we suggest in Section 2 for the actions of S n , O(d) and SO(d), which are all groups of linear isometries.An attractive attribute of this approach is that the same invariant fits all cases, while an advantage of the invariants we present here is that they are more efficient to compute (computing a max filter in these examples is done by solving an optimization problem with relatively high complexity).
Another recent application of our work here, was a derivation of (relatively) efficient separating invariants for the joint action of SO(d) × S n or O(d) × S n on R d×n .This is described in [30].
Paper organization The structure of the paper is as follows: In Section 1 we provide some mathematical background, and use it to state and prove our main theorem.We then discuss in general terms the ways in which this theorem can be used to devise separating invariants of low complexity for various group actions, and relationships to concepts from invariant theory such as generating invariants and polarization..
In Section 2 we describe several applications of the theorem, showing in several examples of interest how a low-dimensional set of separating invariants, which can be computed efficiently, can be obtained using our methodology.These examples include point clouds with multiplication by permutation matrices from the right, or multiplication by orthogonal transformations, rotations, volume preserving linear transformations, general linear transformations, from the left.We also discuss the more trivial scaling and translation actions.
In Section 3 we discuss generic separation.We show that generic separation can be obtained using only D M + 1 invariants, and show that generic separation can be computed efficiently for weighted graphs while full separation is unlikely due to the fact that the graph isomorphism problem has no known polynomial time algorithm.
In Section 4 we give an outline of an argument that shows that separation can be obtained also if the parameters of the separating invariants we consider have finite precision.This argument is applicable only for some of the polynomial invariants we consider here.Finally in Section 5 we provide some initial experiments, showing a simple permutation invariant classification problem on point clouds in R d×n with high ambient dimension n × d, and low intrinsic dimension D M , can be efficiently solved using 2D M + 1 of our separating invariants, as our theory predicts.

Definitions and main theorem 1.Notation
We denote matrices X in R d×n by capital letters and refer to them as point clouds.The columns of a point cloud are denoted by little letters X = (x 1 , x 2 , . . ., x n ).We use 1 n to denote the constant vector (1, 1, . . ., 1) ∈ R n , and e i ∈ R n to denote the n-dimensional vector with 1 in the i-th coordinated and zero in the remaining coordinates.

Mathematical background
We begin this section by explaining how continuous separating invariant functions are used to characterize all continuous invariant functions.We then lay out some definitions we need for our discussion and prove our main theorem regarding the construction of low-dimensional continuous separating invariant functions (Theorem 1.7).

Universality and orbit separation
We begin with defining invariant functions and orbit separation Definition 1.1.Let G be a group acting on a set M, let Y be a set, and let f : M → Y.We say that f is invariant if f (x) = f (gx) for all g ∈ G, x ∈ M. We say that a subset M ′ of M is stable under the action of G if gm ∈ M ′ for all g ∈ G and m ∈ M ′ .Definition 1.2.Let G be a group acting on a set M, let Y be a set, and let f : M → Y be an invariant function.We say that f separates orbits if f (x) = f (y) implies that x = gy for some g ∈ G.We say that a finite collection of invariant functions f i : M → Y, i = 1, . . ., N separates orbits, if the concatenation (f 1 (x), . . ., f N (x)) separates orbits.
The following proposition, proved in the appendix, shows that every continuous invariant function can be written as a composition of a orbit separating, continuous invariant functions and a continuous (noninvariant) function.
Proposition 1.3.Let M be a topological space, and G a group which acts on M. Let K ⊂ M be a compact set, and let f inv : M → R N be a continuous G-invariant map that separates orbits.Then for every continuous invariant function f : M → R there exists some continuous f general : R N → R such that Somewhat more complicated characterizations of equivariant functions via separating invariants are described in [57].Now that we have established the importance of continuous separating invariant for approximation of continuous invariant (or equivariant) functions, we will exclusively focus on the topic of finding a small collection of such continuous separating invariant functions.Our technique for doing so relies on several concepts from real algebraic geometry which we will now introduce.
Real algebraic geometry Unless stated otherwise, the background on real algebraic geometry presented here is from [8].
Definition 1.4 (Semi-algebraic sets).A real semi-algebraic set in R k is a finite union of sets of the form {x ∈ R k |p i (x) = 0 and q j (x) > 0 for i = 1, . . ., N and j = 1, . . ., m} where p i and q j are multivariate polynomials with real coefficients.
Semi-algebraic sets are closed under finite unions, finite intersections, and complements.We next define semi-algebraic functions Definition 1.5 (Semi-algebraic functions).Let S ⊆ R ℓ and T ⊆ R k be semi-algebraic sets.A function Polynomials f : R ℓ → R k are obviously semi-algebraic functions.Similarly given two polynomials p 1 , p 2 : R ℓ → R the rational function q(x) = p1(x) p2(x) is well-defined and semi-algebraic on the semi-algebraic In addition, assume we are given a collection of semi-algebraic sets S 1 , . . ., S n ⊆ R ℓ whose union is all of R ℓ , and a function f whose restriction to each S i is a polynomial f i .We call such functions piece-wise polynomial functions.Piece-wise polynomials are semi-algebraic functions since In particular, this class of functions include ReLU neural networks and the sorting function we will use in Subsection 2.1, which are continuous piece-wise linear functions.Piece-wise linear functions are a special case of piece-wise polynomial functions where each semi-algebraic set S i is a closed convex polyhedron, and each f i is an affine functions.Stratification and dimension A semi-algebraic set S can be written as a finite union of pairwise disjoint sets S 1 , . . ., S n such that each S i is a C ∞ manifold of dimension r i , and the closure of each S i in S contains only S i itself, and sets S j with r j < r i .This decomposition is called a stratification (see [8], page 177).The dimension of S is the maximal dimension max 1≤i≤n r i of all the manifolds in the decomposition (this definition of dimension can be shown to be independent of the stratification chosen).
Figure 2 shows a stratification of the semialgebraic set The set is shown in the left of the figure, and the stratification is visualized on the right.It includes a single two-dimensional open disc, eight open curves (dimension 1), and four points (dimension 0).Hence the dimension of S is two.

Families of invariant separators
We now introduce some definitions needed for discussion of group actions and separation using a real algebraic geometry framework.
Assume that G is a group acting on a set M. The orbit of x ∈ M under the action of a group G is the set When y is in the orbit of x we use the notation x ∼ G y, and when it is not in the orbit of x we use the notation x ̸ ∼ G y.
Definition 1.6.Let G be a group acting on a semi-algebraic set M, and D w be an integer greater than or equal to one.We say that a semi-algebraic function We say a family of G-invariant semi-algebraic functions separates orbits in M, if for all x, y ∈ M such that x ̸ ∼ G y there exists w ∈ R Dw such that p(x, w) ̸ = p(y, w).
We say a family of G-invariant semi-algebraic functions strongly separates orbits in M, if for all x, y ∈ M We note that if p(x, w) is polynomial in w for every fixed x, then separation implies strong separation, since the set of zeros of a polynomial which is not identically zero is always dimensionally deficient.

Main Theorem
We now have all we need to state our main theorem: The remainder of this subsection is devoted to proving this theorem.At a first reading we recommend skipping to Subsection 1.4 at this point.
We begin by recalling some additional real algebraic geometry facts we will need for the proof, also taken from [8].We first recall some basic properties of real algebraic dimension: If f is a diffeomorphism then we have equality dim(f (S)) = dim(S).
4. If A ⊆ R ℓ is a semi-algebraic set of dimension strictly smaller than ℓ then it has Lebesgue measure zero.
Another useful fact we will use is that the projection of a semi-algebraic set is also a semi-algebraic set.
We next state and prove the following lemma Lemma 1.8.Let S ⊆ R D1 be a semi-algebraic set and f : R D1 → R D2 a polynomial.Assume that for all t ∈ f (S) we have that dim(f −1 (t)) ≤ ∆, then dim(S) ≤ dim(f (S)) + ∆ Proof.Denote ∆ S = dim(S).Let S i , i = 1, . . ., N be a stratification of S. Without loss of generality, let us have dim(S 1 ) = ∆ S .Because of this equality, we will be able to argue about ∆ S by arguing only about dim(S 1 ).Fix some s 0 ∈ S 1 so that the differential of f |S1 at s 0 has maximal rank r.The set of s ∈ S 1 whose differential has rank r is open, and so there is a neighborhood of s 0 on which f has constant rank.By the rank theorem [37], f is locally a projection: this means that there exists a diffeomorphism ψ which maps an open set U with s 0 ∈ U ∈ S 1 to (0, 1) ∆ S and a diffeomorphism ϕ : V → R D2 where V in open in R D2 and contains f (U ), such that the function f = ϕ • f • ψ −1 is a projection: f (s 1 , s 2 , . . ., s r , . . ., s ∆ S ) = (s 1 , s 2 , . . ., s r , 0, 0, . . ., 0), ∀(s 1 , . . ., s ∆ S ) ∈ (0, 1) ∆s .
For the projection f .we have for every t in the image of f the equality We can now get our result by exploiting the relationship between f and f .Since f and the restriction of f to U have the same image, we have Plugging the last two inequalities into the left hand side of (3) concludes the proof.
We can now prove Theorem 1.7.Our bundle-based proof presented below was inspired by ideas in [5].
Proof of Theorem 1.
The set π W (B m ) is precisely the set of w 1 , . . ., w m which are not separating.Our goal is to show that, when m is big enough, then the dimension of π W (B m ) is less than mD w and so is has Lebesgue measure zero.
Let us start by bounding the dimension of B m .For every (x, y) ∈ πB m we have , where By assumption p is strongly separating and thus dim(W (x,y) and since applying a π W to B m can only decrease its dimension we obtain that dim(π W (B m )) ≤ mD w − 1 as required.

Using the main theorem
The goal of this subsection is to describe in general terms how Theorem 1.7 can be used to achieve lowdimensional orbit-separating invariants (which can be computed efficiently).In the next section we will apply Theorem 1.7 to find a small, efficiently computed collection of separating invariants for several classical group actions, many of which have been studied in the context of invariant machine learning.These results will be presented as a case-by-case elementary analysis, which requires only a combination of Theorem 1.7 with elementary linear algebra arguments.The purpose of this subsection is to provide a general explanation to the results in the next section, based on known results from invariant theory.
A methodological application of Theorem 1.7 can be achieved by searching for polynomial invariants, and using known results from classical invariant theory which studies these invariants.In particular, for the classical group actions we discuss in the next section, we typically have an available First Fundamental Theorem (FFT) for this group action: that is, a finite set of invariant polynomials f 1 , . . ., f N which are called generators, such that for every invariant polynomial p there exists a polynomial q : R N → R such that The generators of the invariant polynomial ring are algebraic separators 1 -that is, any two distinct orbits which can be separated by any invariant polynomial will be separated by one of the generators.Let us for now assume that on our 'data manifold' M the generators do indeed separate orbits (this is often, but not always, the case.We will return to this issue in a few paragraphs).Typically, the cardinality N of the generators is much larger than what we would like.The easiest (but not recommended) method for achieving a smaller collection of polynomials is by starting with a generating set (or some other possibly large known set of semi-algebraic separating invariants ) and applying linear projection.
Corollary 1.9.Let G be a group acting on a semi-algebraic set M of dimension D M .Assume that f i : M → R, i = 1, . . ., N are semi-algebraic mappings which separate orbits.Then for almost all w (1) , . . ., w (2D M +1) ∈ R N , the functions Proof.Since p(x, w) = N i=1 w i f i (x) is polynomial in w for fixed x, if we can show that p is a family which separates orbits then it also strongly separates orbits and so Theorem 1.7 gives us separation.Given x and y in M whose orbits do not intersect, we know that there is some i such that f i (x) ̸ = f i (y) and so p(x, w = e i ) ̸ = p(y, w = e i ).
The dimensionality reduction technique described in Corollary 1.9 is essentially a random linear projection from R N to R 2D M +1 .This method was used for generating a small number of separating invariants in [6] and [13], and is at the heart of the proofs mentioned earlier for the existence of a small set of separating invariants (see [31,20]).From a computational perspective this approach is sub-optimal as it requires a full computation of all N separating invariants as an intermediate step.
In many of the examples we discuss in the next section, a significantly more efficient approach is provided by the fact that the generating invariants f 1 , . . ., f N are obtained from a single invariant via polarization.In our context polarization can be described as follows: Assume that G is a subgroup of GL(R d ) acting on R d×n and R d×n ′ by multiplication from the left.If f : R d×n ′ → R is G-invariant, then we can combine f , and any linear W ∈ R n×n ′ , to create invariants on R d×n of the form If our original generating invariants f 1 , . . ., f N were all of the form f (XW i ), i = 1, . . ., N , then p is a separating family of semi-algebraic mappings, and so we obtain 2D M + 1 separating invariants p(X, W i ), i = 1 . . ., 2D M + 1 without needing to compute all of the original generators.For more on the relationship between polarization and separation see [19].
Algebraic separation vs. orbit separation We now return to discuss a question we touched upon previously: when do invariant polynomials, and thus the generators, separate orbits?In general, a group action can have two distinct orbits which cannot be separated by any invariant polynomial.The main obstruction is that a continuous function which is constant on G orbits is also constant on the orbits' closure.Thus two orbits which do not intersect cannot be separated if their closures do intersect.The classical example for this is the action of G = {x > 0} on R via multiplication.This action has three orbits: positive numbers, negative numbers, and zero.The closures of these orbits all intersect zero and hence the only invariant functions which are continuous on all of R are the constant functions.We will find similar issues occurring in the next section for the action on R d×n by scaling or multiplication from the left by GL(R d ): in both cases there are no non-constant invariant continuous functions, and thus we will rely on separating rational invariants for these examples (which are not continuous on all R d×n since they have singularities).
The scaling group and GL(R d ) are open subsets of Euclidean spaces.For compact groups, the orbits of the group action will be compact and thus equal to their closures, and so the closures of disjoint orbits will remain disjoint.For closed (non-compact) groups acting on R d×n , d ≤ n,the orbit of every full rank matrix X under the group action is homeomorphic to G and thus closed.Thus for such closed non-compact groups, orbit separation and algebraic separation are identical on the space of d by n full rank matrices which we denote by R d×n f ull .As we will see, when X is not full rank its orbit's closure will often intersect other orbits.In the example in the next Section, we will achieve separation of orbits on all of R d×n for actions of compact groups, and separation of orbits on R d×n f ull for actions of closed non-compact groups.That is, when we can guarantee orbits closures do not intersect, we are able to achieve orbit separation by polynomials.Indeed, for complex linear reductive groups, orbits whose closures do not intersect can always be separated by polynomials [18].These results can be adapted to achieve the separation results we show here: the real groups we discuss are subgroups of complex linear reductive groups, and they share the same set of generators.As such, the separation of orbits for the complex groups implies separation for the real sub-groups.
We stress again that in practice, the proofs that we use for separation of our continuous family of functions rely only on elementary linear algebra and not on the first fundamental theorem and other invariant theory results noted above.We discuss these results in the next section.

Separating invariants for point clouds
In this section we will use Theorem 1.7 to obtain a collection of 2D M + 1 separating invariants (or D M + 1 generically separating invariants) on the data manifold M ⊆ R d×n , for several classical group actions which are of interest in the context of invariant machine learning.For non-compact group actions we will need to assume that M contains only full rank matrices.The group actions we consider are multiplications by permutation matrices from the right, or multiplication by the left by: orthogonal transformations, generalized orthogonal transformations, special orthogonal transformations, volume preserving transformations, or general linear transformations.We will also show that these group actions can be combined with translation and scaling with no additional costs.The complexity of computing the invariants is rather moderate, as can be seen in Table 1 which summarizes the results of this section.

Permutation invariance
We begin by considering the action of the group or permutations on n points, denoted by S n , on R d×n by swapping the order of the points.This group action has been studied extensively in the recent invariant learning literature (e.g., [47,52,58,63]).In particular, the approach we suggest here is strongly related to recent results obtained in [6].This relationship will be discussed in Remark 2.3.
Let us first discuss the simple case when d = 1.Interestingly, in this case the ring of polynomials invariants on R 1×n is generated by only n invariants, known as the elementary symmetric polynomials .An alternative choice of generators ( [35], exercise 8) which can be computed more efficiently are the power sum polynomials.
Let Φ : R n → R n denote the map whose coordinates are the power sum polynomials, that is Φ(x) = (ϕ 1 (x), . . ., ϕ n (x)).It is known that the power sum polynomials separate orbits (for an elementary proof of this see [63]).An alternative way of achieving n-dimensional separation is by sorting: let sort : R n → R n be the map which sorts a vector in ascending order.This map is invariant to permutations and separates orbits.It is a continuous piece-wise linear map (and so a semi-algebraic map), but is not a polynomial.Note that sort(x) can be computed in O(n log(n)) operations while computing Φ(x) requires O(n 2 ) operations.Additionally, sorting has been successfully used for permutation-invariant machine learning [10,64] while power sum polynomials are discussed as a theoretical tool [63,52] but are not used in practice.Finally, in [6] it is shown that sort is an isometry (with respect to the Euclidean metric on the output space and a natural metric on the input quotient space R n /S n ) while Φ is not even Bi-Lipschitz .
For d > 1, separation by polynomials is achievable by the multi-dimensional power sum polynomials, defined as The multi-dimensional power sum polynomials are also generators of the invariant ring.They are used in many papers which prove universality for permutation-invariant constructions [22,52,63].However, the number of power sum polynomials is n+d n : when d > 1 and n ≫ d this is significantly larger than the dimension of R n×d .
Generalizing the successfulness of the function sort in separating orbits to the case d > 1 is less straightforward: it is possible to consider lexicographical sorting: this mapping separates orbits but is not continuous.An alternative generalization could be to sort each row independently-this gives a continuous mapping but it does not separate orbits.We now use Theorem 1.7 to propose a low-dimensional set of invariants for the case d > 1 by polarizing a d = 1 separating invariant mapping Ψ (which could for example be sort or the 1-dimensional power sums Φ): Proposition 2.1.Let M be semi-algebraic subset of R d×n of dimension D M , which is stable under the action of S n by multiplication from the right.Let Ψ : R n → R n be a permutation invariant semi-algebraic mapping which separates orbits, and denote f (X, w (1) , w (2) ) = ⟨w (2) , Ψ(X T w (1) )⟩, X ∈ R d×n , w (1) If m ≥ 2D M + 1 then For Lebesgue almost every (w 1 ), . . ., (w m , w i , w i ), i = 1, . . ., m are separating with respect to the action of S n .
Proof.The permutation invariance of f for every fixed choice of parameters follows from the invariance of Ψ.By Theorem 1.7 it is sufficient to show that the family of semi-algebraic invariant mappings f strongly separates orbits.Fix some X, Y ∈ R d×n with disjoint S n orbits.We need to show that the dimension of the semi-algebraic set B = {(w (1) , w (2) ) ∈ R d × R n |f (X, w (1) , w (2) ) = f (Y, w (1) , w (2) )} is strictly smaller than n + d.Since X can not be re-ordered to be equal to Y , it follows that the set (1) is equal to Y T w (1) up to reordering} has dimension d − 1.Thus it is sufficient to show that the set B = {(w (1) , w (2) ) ∈ R d × R n |f (X, w (1) , w (2) ) = f (Y, w (1) , w (2) ) and w (1) ̸ ∈ B 1 } has dimension ≤ n + d − 1.For fixed w (1) ̸ ∈ B 1 , the orbit separation of Ψ implies that Ψ(X T w (1) ) ̸ = Ψ(Y T w (1) ) and so the set of w (2) for which ⟨w (2) , Ψ(X T w (1) )⟩ = ⟨w (2) , Ψ(Y T w (1) )⟩ has dimension n − 1. Denoting by π the projection of B onto the first coordinate, this means that for every w (1) ∈ π( B) we have that dim(π −1 w (1) ) = n − 1 and from Lemma 1.8 this implies Thus f is strongly separating which concludes the proof.
We conclude this subsection with some remarks on the significance of this result in the context of the existing literature.Firstly, we note that characterizations of permutation-invariant mappings on R d×n which use separating mappings of the form are common in the literature investigating the expressive power of neural networks for sets and graphs (see for example Lemma 5 in [43]).However, these are typically based on the multivariate power sum polynomials, so that the output dimension of F is the unrealistically high n+d n as discussed above.In contrast we can obtain separation on all of R d×n with 2n • d + 1 invariants, or an even smaller number of invariants when restricting to a lower dimensional S n stable set M, by choosing Ψ to be the univariate power sum polynomial mapping Φ defined in (6): Corollary 2.2.Let M be semi-algebraic subset of R d×n of dimension D M , which is stable under the action of S n by multiplication from the right.Then there exists a polynomial mapping is invariant and separating.
proof of corollary.Denote Φ(t) = (t, t 2 , . . ., t n ) so that we have Taking Ψ = Φ in Proposition 2.1, we obtain that for m = 2D M + 1, and Lebesgue almost every choice of parameters, the mapping f (X, w i , w i ), i = 1, . . ., m is invariant and separating.Note that the i-th coordinate of this map is given by where we define F i : R d → R by F i (x) = ⟨w (2) , Φ(x T w (1) )⟩.
Thus the mapping as in ( 8) with F = (F i ) m i=1 is invariant and separating.
Remark 2.3.When choosing Ψ = sort in the formulation of Proposition 2.1, we obtain invariants which are closely related to those discussed in [6].To describe the results in this paper and the relationship to our results here let us first rewrite our results with Ψ = sort in matrix notations: Denote by colsort : R n×m → R n×m the continuous piece-wise linear function which independently sorts each of the m columns of an n × m matrix in ascending order.Let A ∈ R d×m be a matrix whose m columns correspond to w (1) 1 , . . ., w m , and let B ∈ R n×m be matrix whose m columns correspond to w m .Proposition 2.1 can be restated in matrix form as saying that on M = R d×n , for m = 2nd + 1, and Lebesgue almost every A, B, the mapping is invariant and separating, where L B : R n×m → R m and β A : R d×n → R n×m are defined by Note that it follows that β A is invariant and separating as well.
In [6] Balan et al. consider invariant maps which are compositions of β A as defined above with general linear maps L : R n×m → R 2nd .Under the assumption that m > (d − 1)n!, and that the parameters defining A and L are generic, they show that these maps are separating, and moreover, Bi-Lipschitz (with respect to the Euclidean metric on the output space and a natural metric on the input quotient space R d×n /S n ).Thus the main differences between Balan's results and the results here are 1.Balan's proof requires > n! measurements to guarantee separation of β A , while we only require 2nd + 1 measurements.
2. We consider compositions of β A with sparse linear mappings L B (these same mappings are suggested in [65]).In contrast, Balan considers general linear mappings L which are defined by n times more parameters than L B .
3. Balan's results show that β A and L • β A are Bi-Lipschitz.We do not consider this important aspect in this paper.We note that Balan shows that β A is Bi-Lipschitz whenever β A is separating.Thus their results coupled with out own show that β A is Bi-Lipschitz even when m = 2nd+1.The Bi-Lipschitzness of our sparse L B was not directly addressed in [6], and we leave this to future work.

Orthogonal invariance
We now consider the action of the group or orthogonal matrices O(d) on R d×n via multiplication from the left.
We consider a polynomial family of invariants of the form For fixed X, w the cost of computing this invariant is O(n • d).This choice of invariants is a natural generalization of the type of invariants encountered in phase retrieval (see discussion in Subsection 0.2 and Remark 2.6).It also can be seen as a realization of the invariant theory based methodology discussed in 1.4.The ring of invariant polynomials is generated by the inner product polynomials ⟨x i , x j ⟩, It is thus also generated by the polynomials since these polynomials have the same linear span as the inner product polynomials.These invariant are obtained from the squared norm invariant on R d by polarization, and so are all of the form (10) for an appropriate choice of w ∈ R n , that is p(X, w = e i ) = ∥x i ∥ 2 and p(X, w = e i − e j ) = ∥x i − x j ∥ 2 .
Our result is now an easy consequence of the discussion so far and Theorem 1. Proof.By Theorem 1.7 it is sufficient to show that the family of invariant functions p is strongly separating, and as they are polynomials we only need to show separation.It is sufficient to show that the finite collection of polynomials in (10) are separating, which as mentioned above is equivalent to showing that the inner product polynomials ⟨x i , x j ⟩ are separating.This is just the known fact that the Gram matrix X T X determines X uniquely up to orthogonal transformation.see e.g., Lemma 2.7 and its proof in the appendix.
We note that Proposition 2.4 (with a slightly smaller number of separating invariants) can also be deduced immediately from Theorem 4.9 in [49] which discusses the equivalent problem of separating rank one matrices using rank one linear measurements.

Special orthogonal invariance
We now turn to the action of the special orthogonal group SO(d) = {R ∈ O(d), det(R) = 1} on R d×n by multiplication from the left.The invariant ring for this group action is generated by the polynomials in (11) together with the invariant polynomials Accordingly, the generators can all be realized by specific choices of (w, W ) from the family of polynomial invariants p(X, w, W The complexity of calculating each invariant (for fixed w, W ) is dominated by the matrix product XW which with the standard method for matrix multiplication requires O(n • d 2 ) operations.We can easily prove that this family of invariants separates orbits: Proposition 2.5.Let n ≥ d, and let M be a semi-algebraic subset of R d×n of dimension D M , which is stable under the action of SO(d).
are separating with respect to the action of SO(d).
Proof.By Theorem 1.7 it is sufficient to show that the family of invariant functions p is strongly separating, and as they are polynomials we only need to show separation.Let X, Y ∈ R d×n which do not have the same orbit.If X and Y are not related by any orthogonal transformation then we already showed that they can be separated by the 'norm polynomials'.We now need to consider the case where X and Y are not related by a rotation, but are related by X = RY where R is orthogonal with det(R) = −1.In this case we see that X and Y have the same rank.Moreover, they must be full rank, since otherwise we can multiply R by an orthogonal transformation R 0 with det(R 0 ) = −1 which fixes the column span of Y and obtain X = RR 0 Y and det(RR 0 ) = 1, in contradiction to the fact that X and Y do not have the same Remark 2.6.For d = 2 there are more efficient invariants than the ones we suggest here: as mentioned previously, known results [16,5] on complex phase retrieval state that for generic m = 4n−4 complex vectors w (1) , . . ., w (m) in C n , the maps separate orbits of the action of S 1 on C n .Note that the linear map ⟨z, Identify C n ∼ = R 2×n and S 1 ∼ = SO(2), we see that there are m linear SO(2) equivariant maps W (1) , . . ., W (m) from R 2×n to R 2 , modeling multiplication in C 1 , and is SO(2) invariant and separate orbits.Each one of these linear maps W (j) is parameterized by 2n real numbers, while our invariants in (13) are parameterized by 3n parameters (when d = 2).When d ̸ = 2, it would be natural to look for separating invariants of the form (15), where the W (j) are SO(d) equivariant linear maps from R d×n to R d , and avoid the additional determinant term we use in (13).However, in Proposition A.1 in the appendix we show that when d ̸ = 2, the only linear SO(d) equivariant maps are of the form X → Xw with w ∈ R n .These maps are also O(d) equivariant, and as a result ∥Xw∥ is O(d) invariant.It follows that these maps cannot separate point clouds which are related by reflections but do not have the same orbit in SO(d).

Isometry groups for non-degenerate bi-linear forms
The next examples we consider are isometry groups for non-degenerate bi-linear forms.As usual we assume n ≥ d, and we are given a symmetric invertible Q ∈ R d×d which induces a symmetric bi-linear form ⟨x, Qy⟩, x, y ∈ R d .
We define a Q-isometry as a matrix U ∈ R d×d such that U T QU = Q, and thus the symmetric bi-linear form defined by Q is preserved by U : (together with translations) is an important symmetry group in special relativity, and has been discussed in the context of invariant machine learning for physics simulations [57,11].
We consider the task of finding separating invariants for the action of O Q (d) on R d×n by multiplication from the left.A natural place to start is the Q Gram matrix X T QX, whose coordinates are the Q-inner products ⟨x i , Qx j ⟩, 1 ≤ i ≤ j ≤ n.Indeed, at least for the Lorenz group it is known that the inner product polynomials are indeed generators [57].When Q is positive definite, the Q-inner products do indeed separate orbits.When Q is not positive definite, this is no longer always true: consider the following example with d = 2, n = 2 and We see that the Q-Gram matrix of X and Y is both zero, while X and Y cannot be related by a Q-isometry since Q-isometries are invertible.However, the Q-inner products do separate orbits when restricted to the set of full rank matrices R d×n f ull .The following lemma, proved in the appendix, formulates this claim.The proof is essentially taken from [26] Once we know that the Q inner products separate orbits on R d×n f ull , we proceed as we did for O(d).We see that the 'Q-norm polynomials' span the Q-inner product polynomial and hence are also separating on R d×n f ull .We can then prove an analogue of Proposition 2.4 using the same arguments used there: Proposition 2.8.Let n ≥ d, let Q ∈ R d×d be symmetric and invertible, and let M be a semi-algebraic subset of R d×n f ull of dimension D M , which is stable under the action of O Q (d).If m ≥ 2D M + 1 then For Lebesgue almost every w 1 , . . ., w m in R n the invariant polynomials X → ⟨Xw i , QXw i ⟩, i = 1, . . ., 2nd + 1 are separating with respect to the action of O Q (d).

Special linear invariance
We now give a full treatment for the group action described in Example0.1.We consider the action of the special linear group SL(d) = {A ∈ R d×d | det(A) = 1} on R d×n by multiplication from the left.The generators for the ring of invariants is given by the determinant polynomials [35] [i 1 , . . ., i d ](X) = det(x i1 , . . ., which we have already encountered in Subsection 2.3.The generators cannot separate matrices in R d×n which are not full rank, since for such matrices we will always get zero determinants.In Proposition A.2 in the appendix we give an elementary proof that the determinant polynomials from ( 16) separate orbits on R d×n f ull .The separation of the determinant polynomials in (16) together with Theorem 1.7 implies Proposition 2.9.Let n ≥ d, and let M be a semi-algebraic subset of R d×n f ull of dimension D M , which is stable under the action of SL(d).If m ≥ 2D M + 1 then For Lebesgue almost every W 1 , . . ., W m in R n×d the invariant polynomials X → det(XW i ), i = 1, . . ., m are separating with respect to the action of SL(d).

Translation
We consider the action of R d on R d×n by translation: We can easily compute n•d separating invariants for this action: examples include the mapping X → X−x 1 1 T n suggested in [57] or the centralization mapping cent(X) = X − 1 n X1 n 1 T n .The centralization mapping is equivariant w.r.t the action of multiplication by a matrix A ∈ GL(R d ) from the left, and a permutation matrix P from the right.That is It follows that if f : R d×n → R m is invariant with respect to some group G which is a subgroup of GL(R d )×S n , then f (cent(X)) will be invariant with respect to the group ⟨G, R d ⟩ generated by G and the translation group.Additionally, if f separates orbits w.r.t the action of G, then f separates orbits with respect to the action of ⟨G, R d ⟩.To see this, note that if X, Y ∈ R d×n and f (cent(X) = f (cent(Y )), then since f separates orbits we have that there exist some (A, P ) ∈ G ≤ GL(R d ) × S n such that cent(X) = Acent(Y )P T , and so X is obtained from Y by translation by the mean of Y , follows by a G action, and translation by the mean of X.

Scaling
We consider the action of R >0 = {x > 0} on R d×n by scaling (scalar-matrix multiplication).In this case there are no non-constant invariant polynomials, or in fact any non-constant invariants which are continuous on all of R d×n .This is because the orbit of each X ∈ R d×n contains the zero matrix 0 ∈ R d×n in its closure.However, we can easily come up with non-polynomial separating invariants with singularities at zero, such as X → ∥X∥ −1 X, where ∥ • ∥ denotes some norm on R d×n .If we choose the Frobenius norm ∥ • ∥ F , this mapping is equivariant with respect to multiplication by an orthogonal matrix from the left and a permutation matrix from the right.As a result, if f : R d×n → R m is invariant with respect to some group G which is a subgroup of O(d) × S n , then X → f (∥X∥ −1 F X) will be invariant with respect to the group generated by G and the scaling group.Additionally, if f separates orbits with respect to the G action, then X → f (∥X∥ −1 F X) separates orbits with respect to the group generated by G and the scaling group.

General Linear invariance
We consider the problem of finding separating invariants for the action of the general linear group GL(R d ) on R d×n by multiplication from the left.There are no non-constant polynomial invariants for this action, since this is the case even for the scaling group which is a subgroup of GL(R d ).We consider a family of rational invariants The function q is well defined on R d×n f ull × R n×d , and for fixed W the function X → q(X, W ) is GL(R d )invariant.We prove: Proposition 2.10.Let n ≥ d, let M be a semi-algebraic subset of R d×n f ull of dimension D M , which is stable under the action of GL(R d ).If m ≥ 2D M + 1, then for Lebesgue almost every W 1 , . . ., W m in R n×d the invariant polynomials X → q(X, W i ), i = 1, . . ., m are separating with respect to the action of GL(R d ).
Proof.By Theorem 1.7 it is sufficient to show that the family of rational functions q is strongly separating.In fact since q(X, W ) is polynomial in W for every fixed X, it is sufficient to show orbit separation.Let X, Y ∈ R d×n f ull be two full rank point clouds whose orbits do not intersect.Since X is full rank, it has d columns which are linearly independent, for simplicity of notation we assume these are the first d columns.If the first d columns of Y are not linearly independent, then by choosing is zero on Y and not on X and so it separates the two points.Thus we can assume that the first d columns of Y are linearly independent.It follows that the matrix A defined uniquely by the equations Ax i = y i , i = 1, . . ., d is non-singular.
By assumption, AX ̸ = Y so there exists some index j, d < j ≤ n such that Ax j ̸ = y j .Since y 1 , . . ., y d span R d , there exist α 1 , . . ., α d and β 1 , . . ., β d such that and since Ax j ̸ = y j there exists some k, 1 ≤ k ≤ d such that α k ̸ = β k .Let W 1 ∈ R n×n be a matrix such that for all Z = [z 1 , . . ., z n ] ∈ R d×n we have Then the first d columns of Y W 1 have rank d − 1, while the first d columns of AXW 1 , and therefore also of Thus we have shown that q separates orbits.

Intractable separation for permutation actions on graphs
Consider the action of the permutation group S n on the vector space R n×n by conjugation: given a permutation matrix P ∈ S n , and a matrix X ∈ R n×n , this action is defined as If A is an adjacency matrix of a graph, applying a relabeling σ to the vertices, creates a new graph, isomorphic to the previous one, whose adjacency matrix A ′ is equal to A ′ = P AP T for P the matrix representation of the permutation σ.We are thus interested in studying this action on the set of weighted adjacency matrices M weighted defined as We note that M weighted is stable under the action of S n , and has dimension n weighted = (n 2 − n)/2.More generally, we will want to think of this action of S n on S n stable semi-algebraic subsets M of M weighted of arbitrary dimension D M .For example, the collection of all binary (unweighted) graphs can be parameterized by the finite S n stable set Another natural example includes (weighted or unweighted) graphs of bounded degree.
Let us now consider the task of constructing separating invariants for the action of S n on a semi-algebraic stable subset M ⊆ M weighted of dimension D M .As our discussion suggest, we will be able to find such separating invariants of dimension 2D M + 1.However, the computational effort involved in computing the invariants in our constructions grows superpolynomially in n.This is not surprising as a polynomial time algorithm for computing separating invariants for the action of S n on M binary , will lead to a polynomial time algorithm for the notoriously hard Graph Isomorphism problem (see [27]).
One simple separating family for the action of S n on M weighted is polynomials of the form Clearly for fixed W the polynomials X → p(X, W ) is permutation invariant, and separation follows from the fact that if X, Y ∈ M weighted and X ̸ ∼ Y , then taking W = X we obtain p(X, W ) = 0 ̸ = p(Y, W ).
Thus by Theorem 1.7, we can obtain m = 2D M + 1 separating invariants for the action of S n on M, as for Lebesgue almost every W i , i = 1, . . ., m in (R n×n ) m , the functions X → p(X, W i ), i = 1, . . ., m are invariant and separating.Note however that the degree of these polynomials is 2 • n! and so computing these invariants is not tractable.

Generic separation
Generic separation is a relaxed notion of separability which is often easier to achieve than full separation: Definition 3.1 (Generic separation).Let G be a group acting on a semi-algebraic set M. Let Y be a set, and let f : M → Y be an invariant function.We say that f is generically separating on M, with singular set N , if N ⊆ M is a semi-algebraic set which is stable under the action of G, satisfies dim(N ) < dim M, and for every x ∈ M \ N , if there exists some y ∈ M such that f (x) = f (y), then x ∼ G y.
Note that being generically separating on M is slightly stronger than being separating on M \ N , since the latter would only consider Y ∈ M \ N .Some possible practical disadvantages of generic separating invariants in comparison to fully separating invariants were discussed in Subsection 0.1.Our purpose in this section is to show that achieving generic separation is easier than achieving full separation in two respects: 1.While 2D M + 1 separating invariants can be obtained by randomly choosing parameters of families of strongly separating invariants, when considering generic separation D M + 1 invariants suffice.We discuss this next in Subsection 3.1.
2. More importantly, for some group actions it is easy to come up with generic separators which can be computed efficiently, but obtaining true separators with low complexity seems inaccessible.This is discussed in Subsection 3.2.

Generic separation from generic families of separators
In this section we prove an analogous theorem to Theorem 1.7 where now we discuss generic invariants.The notion of generic separating invariants was defined in Definition 3.1.We now define this notion for families of invariants: Definition 3.2 (Strong generic separation for invariant families).Let G be a group acting on a semialgebraic set M. We say that a family of semi-algebraic functions p : M × R Dw → R strongly separates orbits generically on M ,with respect to a singular set N , if N ⊆ M is a semi-algebraic set which is stable under the action of G, satisfies dim(N ) < dim M, and for every x ∈ M \ N and y ∈ M with x ̸ ∼ G y, the set We can now state an analogous theorem to Theorem 1.7 for separating invariants.As mentioned above, the cardinality for generic separation is D M + 1 and not the 2D M + 1 we have in Theorem 1.7 for full separation.
Our goal next is to bound the dimension of the fiber π −1 W (W ) over W .Let U 1 , . . ., U K be a stratification of B m so that each U k is a manifold and ∪ K k=1 U k = B m .For every fixed k, if the dimension of π W (U k ) is less than mD w then almost all W will not be in the projection and so the intersection of the fiber over these W with U k will be empty.Now let us assume that the dimension of π W (U k ) is mD w .By Sard's theorem [37], almost all W in π W (U k ) is a regular value of the restriction of π W to U k .By the pre-image theorem [56], every regular value W is either not in the image, or the dimension of its fiber π It follows that for almost all W = (w 1 , . . ., w m ), the fiber over has dimension strictly smaller than D M .Thus this is also true for the projection of the fiber onto the x coordinate, which is the set y but p(x, w j ) = p(y, w j ), ∀j = 1, . . ., m} it follows that for such W = (w 1 , . . ., w m ), the invariants p(•, w i ), i = 1, . . ., m are generically separating on M with singular set N ∪ N W .
Proof (sketch) of Lemma 4.2.We will first move to the complex setting, where can use the Bezout's theorem.Then we will restrict our attention to the real locus.To move to the complex setting, we now think of p(x, y, w) as a polynomial over complex variables.We define (x ∼ p y) for x, y ∈ C D if p(x, w) = p(y, w) for all w ∈ C Dw .Note that under this definition, there may be x, y ∈ R D such that (x ∼ G y) while (x ̸ ∼ p y) (due to some non-real valued, separating w).What is true is that (x ̸ ∼ G y) implies (x ̸ ∼ p y).
Define the polynomial over C D+D+Dw : q(x, y, w) = p(x, w) − p(y, w).This has degree r.For i = 1, .., m, we can define the polynomials q i (x, y, w 1 , ..., w m ) := q(x, y, w i ).Together, these q i define a variety V in C D+D+mDw , which contains the bundle By assumption, the set of (x ∼ p y) must satisfy q(x, y, w) = 0 for all w.Taking the intersection over a sufficiently large finite collection of such w must stabilize and define a strict subvariety U of p-equivalent pairs.B is obtained from V by removing U .This can remove some of the components of V as well as some nowhere-dense subsets of other components.Thus the Zariski closure B must consist of some subset of the components of V .Note that the Zariski closure does not increase the dimension of B, which is bounded from above by 2D + mD W − m (using the argument from Theorem 1.7 for the complex setting).Let us stratify B into pure dimensional algebraic sets V i .From Bezout's theorem (see [29,Chapter 18] and see [54], especially remark 2), and the fact that V is defined using the intersection of m varieties, each of degree r, each of these V i (made up of components of V ) is of degree at most r m .There are at most 2D + mD w − m such i.
Next we project each of these V i onto C mDw .By our assumption that m ≥ 2D + 1, we know that this projection is of dimension less than mD w .The image of a fixed V i can be stratified into constructible sets V ij of pure dimension.There are at most mD w −1 of these.From Bezout, the closure of each such V ij is a variety of degree at most r m (and of this same pure dimension).Each V ij is contained in an algebraic hypersurface of at most the same degree.Taking the union over these (mD w − 1)(2D + mD w − m) hypersurfaces shows that the image of this projection must satisfy a single non-trivial polynomial equation F , where that polynomial's degree is at most (mD w − 1)(2D + mD w − m)(r m ).
Let us define a set of w 1 , ..., w m , each in C Dw to be p-bad if there is an x, y, each in C D , with (x ̸ ∼ p y) but such that, for all i, we have p(x, w i ) = p(y, w i ).By construction, the p-bad set lies in the zero set of F. Let us define a set of w 1 , ..., w m , each in R Dw to be g-bad if there is an x, y, each in R D , with (x ̸ ∼ G y) but such that, for all i, we have p(x, w i ) = p(y, w i ).The G-bad set lies in the real locus of the p-bad set.Thus it lies in the zero sets of F r and F i , the real polynomials defined by taking respectively the real/imaginary components of the coefficients of F .At least one of these two polynomials is non-zero.Such a non-zero polynomial gives us our f in the statment of the lemma.
It is not as clear how to cover the full real semi-algebraic setting.

A Toy Experiment
To visualize the possible implications of our results to invariant machine learning, we consider the following toy example: we create random high dimensional point clouds in R 3×1024 which reside in an S n , n = 1024 invariant 'manifold' M of low-intrinsic dimension D M .In fact, M is a union of two invariant 'manifolds' M = M 0 ∪ M 1 of dimension D M , and we consider the problem of learning the resulting binary classification task.
The binary classification task is visualized in Figure 1(a).In this figure D M = 1, each M 0 , M 1 is a line in R 3×1024 and all its possible permutations, and points in M 0 , M 1 are projected onto R 3 for visualization.While this data may appear hopelessly entangled, using the permutation invariant mapping we describe in (9) with m = 2D + 1 = 3 to embed M 0 , M 1 into R 3 we obtain very good separation of the initial curves as shown in Subplot (b).Note that the non-intersection of the images of M 0 , M 1 is guaranteed by Proposition 2.1.
In Figure 1(c) we show the results obtained for the binary classification task by first computing the invariant embedding in (9) with randomly chosen weights, and then applying an MLP (Multi-Layer-Perceptron) to the resulting embedding.The results on train and test data are shown for various choices of intrinsic dimension D = D M and embedding dimension m.In particular, for the D = 1, m = 3 case visualized in Figure 1(a)-(b) we get 98% accuracy on the test dataset.
The diagonal entries in the tables show the accuracy obtained for varying intrinsic dimensions D and embedding dimension m = 2D + 1. Recall that Proposition 2.1 our embedding in separating for these D, m values, and thus theoretically perfect separation can be obtained by applying an MLP to the embedding.The diagonal entries in the tables show that indeed high accuracy can be obtained for these (D, m) pairs.At the same time, we also see that taking higher dimensional embeddings m > 2D + 1 leads to improved accuracy.This is parsimonious with the common observation that deep learning algorithms are more successful in the over-parameterized regime, as well as results on phase retrieval [15] and random linear projections [7], where the embedding dimension needed for stable recovery is typically larger than the minimal dimension needed for injective embedding.In any case, we note that in all cases we obtain high accuracy with embedding dimension much smaller than the extrinsic dimension 3 × 1024 = 3072.
Additional details on the experimental setup can be found in Appendix B. Code for reconstructing our experiment can be found in [1].

Conclusion and future work
The main result of this paper is providing a small number of efficiently computable separating invariants for various group actions on point clouds.Many interesting questions remain.One example is studying the optimal cardinality necessary for separation.As mentioned above, in phase retrieval it is known that the number of invariants needed for separation is slightly less than twice the dimension, and we believe this is the case for other invariant separation problems we discuss here as well.Another important question, which is discussed e.g., in [6,14] is understanding how stable given separating invariants are: separating invariants are essentially an injective mapping from the quotient space into R m .Stability in this context means that the natural metric on the quotient space should not be severely distorted by the injective mapping.
Perhaps the most important challenge is translating the theoretical insights presented here into invariant learning algorithms with strong empirical performance, provable separation and universality properties, and reasonable computational complexity.
A useful direction for reducing computational complexity is 'settling' for generic separation which as we saw can often be achieved with a small computational burden.In general, the downside of this is that there is a low-dimensional singular set on which there is no separation.This disadvantage will only be significant, for a given learning task, if a significant percentage of the data resides on or near the singular set.Therefore it could be useful to understand what the singular sets of various generic separators are, and what the likelihood of encountering them in specific data is.
1, 000 training samples and 1, 000 test samples by randomly choosing between M i , i = 0, 1, then randomly choosing a probability vector t ∈ R D+1 and using it do define a point cloud as a convex combination of the D + 1 point clouds used to generate M i , and finally applying a random permutation.
For classification we used the same MLP architecture used in Point-Net [47], with input dimension m varying as shown in the table, and three hidden layers of dimensions 1, 024, 512, 256.Batch normalization was applied to each layer, we used dropout with p = 0.3 and used Adam for optimization with learning rate 0.001.Each entry in the tables in Figure 1(c) shows the average accuracy of ten random initializations of the embedding and network parameters.
For visualization purposes, in Figure 1(a) we only show 30 permutations applied to the original lines and not all possible permutations.In Figure 1(a)-(b) we also took n = 10 rather than n = 1024 as used in Figure 1(c), as for larger values of n the lines in (b) have more oscillations and are more difficult to visualize.

Figure 1 :
Figure 1: Figure 1 (a) shows two lines in R 3×n and their image under (some) random Sn permutations, visualized by projecting into R 3 Subplot (b) shows the image of these lines (dimension D = 1) under the mapping we describe in (9) with m = 2D + 1 = 3.As guaranteed by Proposition 2.1, these images do not intersect.The tables in Subplot (c) shows the training and test error when training an MLP to classify the curves based on the embedding in (b), for various values of D and m.

Figure 2 :
Figure 2: Stratification of a semi-algebraic set.See explanation in main text.

7 .
The set {(x, y) ∈ M × M| x ̸ ∼ G y} is semi-algebraic as it is the projection onto the (x, y) coordinates of the semi-algebraic set {(x, y, w)| p(x, w) ̸ = p(y, w)} It follows that the for every m ∈ N the set

7 : 2 . 4 .
Proposition Let n ≥ d, let M be a semi-algebraic subset of R d×n of dimension D M , which is stable under the action of O(d).If m ≥ 2D M + 1 then For Lebesgue almost every w 1 , . . ., w m in R n the invariant polynomials X → ∥Xw i ∥ 2 , i = 1, . . ., m are separating with respect to the action of O(d).
) which we denote by O Q (d).The orthogonal group O(d) discussed earlier corresponds to Q = I d .Indefinite orthogonal groups O(s, d − s) correspond to diagonal Q matrices with s positive unit entries and d − s negative unit entries.In particular the Lorenz group O(3, 1)

Theorem 3 . 3 .
Let G be a group acting on a semi-algebraic set M of dimension D M .Let p : M × R Dw → R be a family of G-invariant semi-algebraic functions.If p strongly separates orbits generically in M, then for Lebesgue almost-every w 1 , . . ., w D M +1 ∈ R Dw the D M + 1 G-invariant semi-algebraic functions p(•, w i ), i = 1, . . ., D M + 1 generically separate orbits in M. Proof of Theorem 3.3.Similarly to the proof of Theorem 1.7, we can consider the 'bad set' , m} and repeat the dimension argument used there, together with our requirement that m ≥ D M + 1, to obtain dim(B m ) ≤ 2D M + m(D w − 1) ≤ D M + mD w − 1.

Table 1 :
[29]ary of the results in Section 2. For each group action we describe a parametric family of separators, the complexity of computing each separator, the domain on which separation is guaranteed, and the number of generators for this group action.In comparison, the number of random separators needed in all examples is 2nd + 1 (or 2D M + 1 when considering separation on a semi-algebraic subset M ) of separation is sufficient to prove universality only for compact sets in M \ N .Another disadvantage is that it is not inherited by subsets: generic orbits separation on M is not necessarily preserved on a subset M ′ ⊆ M, since it is even possible that M ′ is completely contained in N .The advantage of generic separation is that it is generally easier to achieve.Indeed, in Theorem 1.7 we show that while we need 2D M + 1 random measurements for orbit separation, only D M + 1 measurements are sufficient for generic separation.This results resembles classical results which show that for an irreducible algebraic variety of dimension D embedded in high dimension, almost every projective projection down to dimension D + 1 will be generically one-to-one.(Forexamplesee Example 7.15 and Exercise 11.23 in[29].)