1 Introduction

In recent years, DNNs have played a more and more important role in 3D analysis. DNNs are capable of processing many types of 3D data, including multi-view images (Su et al. 2015; Qi et al. 2016; Yu et al. 2018), voxels (Maturana and Scherer 2015; Zhou and Tuzel 2018), point clouds (Qi et al. 2017a; Wang et al. 2019b; Fei et al. 2022), and particles (Schütt et al. 2017; Thomas et al. 2018; Satorras et al. 2021b). They have outperformed traditional methods and shown great generalizability in a sequence of tasks, like classification (Su et al. 2015; Qi et al. 2017a; Wang et al. 2019b), segmentation (Landrieu and Simonovsky 2018; Meng et al. 2019; Furuya et al. 2020), detection (Zhou and Tuzel 2018; Shi et al. 2019; Wang et al. 2023b), property prediction (Schütt et al. 2017; Satorras et al. 2021b), and generation (Hoogeboom et al. 2022; Guan et al. 2023).

Nonetheless, significant gaps exist between experiments and applications, restricting the actual deployment of DNNs. For example, most experiments are conducted under ideal settings with little noise, known data distribution, and canonical poses, which cannot be completely met in practical applications. Among them, canonical poses are widely adopted in 3D research, where 3D data is first aligned manually and then processed by DNNs. However, such a setting leads to two main problems. First, these models may have severe performance drops when evaluated with non-aligned 3D data, as shown in previous works (Esteves et al. 2018a; Sun et al. 2019b; Zhao et al. 2022a). Zhao et al. (2020b) explore the fragility of 3D DNNs and achieve an over 95% successful rate of black-box adversarial attacks through slightly rotating the evaluation 3D data. Second, these DNNs cannot be applied to solve tasks requiring the output consistency. For example, the atomization energies of molecules are irrelevant to their absolute positions and orientations (Blum and Reymond 2009; Rupp et al. 2012). If DNNs are trained with aligned molecules, they inevitably learn the nonexistent relationship between absolute coordinates and molecular properties and may overfit training data. These models are unreliable and useless as they cannot give the same prediction concerning arbitrarily-rotated inputs. There have been many ways to address such problems. We summarize them as rotation invariant and equivariant methods in this survey.

Rotation invariance has been investigated in traditional 3D descriptors. Before the emergence of DNNs, most methods can only capture low-level geometric features based on transformation invariance. FPFH (Rusu et al. 2009) combines coordinates and estimated surface normals to define Darboux frames. Then it uses several angular variations to represent the surface properties. SHOT (Tombari et al. 2010) designs unique and unambiguous local reference frames (LRFs) to construct robust and expressive 3D descriptors. Drost et al. (2010) create a global description with point pair features (PPFs) composed of relative distances and angles. They can effectively handle tasks like pose estimation and registration. Recently, Horwitz and Hoshen (2023) revisit the importance of traditional descriptors on 3D anomaly detection. DNNs can learn high-level semantic features and accomplish complicated tasks, but they usually ignore the rotation invariance and equivariance, making them unreliable for real-world applications. Existing works deal with this problem from different perspectives. T-Net (Qi et al. 2017a) directly regresses transformation matrices from raw point clouds to transform poses and features. ClusterNet (Chen et al. 2019b) constructs k nearest neighbors (kNN) graphs and computes several invariant distances and angles, which are fed into hierarchical networks for complicated downstream tasks. Tensor field networks (TFNs) (Thomas et al. 2018; Thomas 2019) are equivariant neural networks based on the irreducible representation of SO(3). They have a solid mathematical foundation and perform well over various tasks, including shape classification and RNA structure scoring.

Fig. 1
figure 1

Overview of our survey. After the mathematical background is stated, rotation invariant and equivariant methods are introduced, respectively. Then we give a comprehensive overview of applications and datasets and point out future directions based on open problems. Best viewed in color

Many distinctive approaches have been developed for rotation invariance and equivariance. However, a comprehensive review of these methods is absent, making it challenging to keep pace with the recent progress and select appropriate methods for specific tasks. Therefore, we are motivated to write this survey and fill the gap. Our contributions can be summarized from three aspects. First, this survey systematically overviews existing works related to rotation invariance and equivariance, which are further divided into several subcategories based on their structures and mathematical foundations. Second, we unify the notations of different methods, providing an intuitive perspective for analysis and comparisons. Third, we point out some open problems and propose future research directions based on them.

This paper is organized as shown in Fig. 1. In Sect. 2, we introduce the mathematical background of rotation invariance and equivariance, including the definition, commonly-used rotation groups, and evaluation metrics. Rotation invariant and equivariant methods are comprehensively overviewed and discussed, respectively, in Sect. 3 and Sect. 4. The applications and datasets are also inspected in Sect. 5. In Sect. 6, we point out several future research directions based on unsolved problems. Notations are listed in Table 1 for better readability.

Table 1 Notations adopted in this survey

2 Background

This section introduces the background knowledge required to understand rotation invariance and equivariance. The basic concepts of group theory are beneficial for better comprehension. Readers may refer to other textbooks for more details, including Group Theory in Physics: An Introduction (Cornwell 1997) and Algebra (Artin 2013).

Invariance and equivariance have been formulated in much related work (Cohen and Welling 2016; Thomas et al. 2018; Cohen et al. 2018a, 2019a; Thomas 2019). However, their definition cannot cover some methods in this survey. Thus, we elaborately make a broad definition to include them. The definition of both strong/weak invariance and equivariance can be seen in Definition 1. Compared with the previous definition, we introduce weak invariance and equivariance through the G-variant error so as to cover methods not satisfying Eq. 1. It should be noted that the determination of C as an exact value is unnecessary since any function is C-weakly equivariant if C is large enough (\(+\infty\)). So C is generally omitted in this survey. If a method is weakly equivariant, it means that its G-variant error is relatively small or reduced after appropriate training.

Definition 1

Suppose that G acts on \(\mathcal {X},\mathcal {Y}\), and \(f:\mathcal {X}\rightarrow \mathcal {Y},\ d:\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}_{\ge 0}\) is a metric on \(\mathcal {Y}\).

f is strongly equivariant with respect to G, if

$$\begin{aligned} f\left( g\cdot x\right) =g\cdot f\left( x\right) ,\forall x\in \mathcal {X},g\in G. \end{aligned}$$
(1)

Meanwhile, f is C-weakly equivariant with respect to G, if

$$\begin{aligned} \int _{\mathcal {X}}\int _{G}d\left( f\left( g\cdot x\right) ,g\cdot f\left( x\right) \right) \text {d}\mu \left( g\right) \text {d}x<C. \end{aligned}$$
(2)

Specifically, if the group action of G on \(\mathcal {Y}\) is trivial, i.e., \(\forall g\in G,\forall y\in \mathcal {Y},g\cdot y=y\), then f is C-weakly/strongly invariant with respect to G. For discrete \(\mathcal {X}\) or G, the integration on the left side of Eq. 2 is substituted with summation. The integral is named the G-variant error, denoted by \(\mathcal {E}\left( f\right)\).

SO(3), O(3), SE(3), E(3), and their proper subgroups are the commonly-used groups that describe 3D rotation, reflection, and translation. Their differences are listed in Table 2. Unless otherwise specified, we focus on rotation in the 3D Euclidean space, and G is a subgroup of SO(3).

Table 2 The differences among SO(3), O(3), SE(3), and E(3)

Rotation invariant and equivariant methods require specific evaluation metrics to reflect the performances on certain tasks and the invariance/equivariance. Let us take a supervised learning task with N training samples \(\left\{ \left( x_i,y_i\right) \right\} _{i=1}^N\) as an example. \(f:\mathcal {X}\rightarrow \mathcal {Y}\) is the deep model and \(L:\mathcal {Y}\times \mathcal {Y}\rightarrow \mathbb {R}\) is the evaluation function. If there is no requirement on equivariance, the metric is computed as \(\mathcal {L}=\sum _{i}L\left( f\left( x_i\right) ,y_i\right)\). However, if equivariance is considered, the model f should consider \(L\left( f\left( g\cdot x_i\right) ,g\cdot y_i\right)\) for all \(g\in G\) instead of only \(L\left( f\left( x_i\right) ,y_i\right)\). Accordingly, the metric \(\mathcal {L}_G\) is given as

$$\begin{aligned} \mathcal {L}_G=\sum _{i}\int _{G}L\left( f\left( g\cdot x_i\right) ,g\cdot y_i\right) \text {d}\mu \left( g\right) . \end{aligned}$$
(3)

If f is strongly equivariant and L is strongly invariant, then \(\mathcal {L}_G\) degenerates into \(\mathcal {L}\). As the integration is computationally inefficient, most previous works approximate the metric with randomly-rotated samples.

3 Rotation invariant methods

Invariance is a particular and straightforward case of equivariance. Rotation invariant methods aim to produce the same or close results for inputs with different poses. We will show the basic essence of these methods and discuss their advantages and drawbacks. Several milestone methods are shown in Fig. 2.

Fig. 2
figure 2

Milestones of rotation invariant methods. Best viewed in color

3.1 Data augmentation methods

Data augmentation methods only make changes to the loss function instead of any model structure. They use samplings to estimate the integration in Eq. 3. Thus, the loss \(\mathcal {L}_G\) is constructed as

$$\begin{aligned} \mathcal {L}_G=\sum _{i,\hat{g}}L\left( f\left( \hat{g}\cdot x_i\right) ,y_i\right) , \end{aligned}$$
(4)

where \(\hat{g}\) is sampled from G.

Many methods use data augmentation, and only some representative ones are listed here. Kajita et al. (2017) exploit rotated replicas to increase the classification accuracy. Zhuang et al. (2019); Zhu et al. (2020) leverage a Rubik’s cube recovery task with permutation and random rotation to learn invariant features from medical images. Choy et al. (2019); Bai et al. (2020) observe that fully convolutional neural networks (CNNs) can gain rotation invariance through data augmentation. Zhou et al. (2022a) utilize random rotations to learn invariant representations for point cloud generation. Bergmann and Sattlegger (2023) apply rotation augmentation on anomaly-free training samples for 3D anomaly detection.

Although data augmentation methods can enhance rotation robustness (Kajita et al. 2017; Choy et al. 2019; Bai et al. 2020), they have severe limitations. Data augmentation generally introduces a heavy training burden. For example, Kajita et al. (2017) use 30 times as much rotated data to progress significantly on rotation invariant descriptors. Besides, data augmentation methods cannot guarantee their invariance on arbitrary rotations, because Eq. 4 cannot minimize the loss on unseen rotations. Practically, data augmentation is often integrated into other rotation invariant methods as an auxiliary component.

3.2 Multi-view methods

Unlike data augmentation methods, multi-view methods attain rotation invariance by modifying the model instead of the loss function. In multi-view methods, the model \(f:\mathcal {X}\rightarrow \mathcal {Y}\) is built as

$$\begin{aligned} f\left( x\right) =\sum _{\hat{g}_j\in \widehat{G}}w_jf_\text {b}\left( \hat{g}_j\cdot x\right) , \end{aligned}$$
(5)

where \(\widehat{G}\) is a finite subset of G, \(f_\text {b}:\mathcal {X}\rightarrow \mathcal {Y}\) is the base model, and \(w_j>0,\sum _jw_j=1\). The metric d is generally convex, so f has a lower G-variant error than \(f_\text {b}\) as Eq. 6 shows, meaning f is more invariant. A simple yet effective approach is choosing \(\widehat{G}\) as a finite subgroup of G and \(w_j=\frac{1}{\left| \widehat{G}\right| }\), then f is strongly invariant with respect to \(\widehat{G}\).

$$\begin{aligned} \mathcal {E}\left( f\right) =&\int _{\mathcal {X}}\int _{G}d\left( \sum _{j}w_jf_\text {b}\left( \hat{g}_jg\cdot x\right) ,\sum _{j}w_jf_\text {b}\left( \hat{g}_j\cdot x\right) \right) \text {d}\mu \left( g\right) \text {d}x\nonumber \\ \le&\int _{\mathcal {X}}\int _{G}\sum _{j}w_jd\left( f_\text {b}\left( \hat{g}_jg\cdot x\right) ,f_\text {b}\left( \hat{g}_j\cdot x\right) \right) \text {d}\mu \left( g\right) \text {d}x\nonumber \\ =&\sum _{j}w_j\int _{\mathcal {X}}\int _{G}d\left( f_\text {b}\left( \hat{g}_jg\hat{g}_j^{-1}\cdot x\right) ,f_\text {b}\left( x\right) \right) \text {d}\mu \left( g\right) \text {d}x\nonumber \\ =&\int _{\mathcal {X}}\int _{G}d\left( f_\text {b}\left( g\cdot x\right) ,f_\text {b}\left( x\right) \right) \text {d}\mu \left( g\right) \text {d}x=\mathcal {E}\left( f_\text {b}\right) . \end{aligned}$$
(6)

As CNNs become a powerful tool for images (Krizhevsky et al. 2012; Simonyan and Zisserman 2014; Szegedy et al. 2015; He et al. 2016), researchers exploit multi-view images to extract features from 3D shapes. Most multi-view methods take images as input, while some later methods also process point clouds and voxels. Although these methods are designed as 3D feature extractors firstly, they can improve rotation invariance and are chosen as baselines by related work (Esteves et al. 2018a; Rao et al. 2019; Zhang et al. 2019a) (Fig. 3). MVCNN (Su et al. 2015) is a pioneering method showing that a fixed set of rendered views is highly informative for 3D shape recognition. VoxNet (Maturana and Scherer 2015) pools multi-view voxel features and achieves 2D rotation invariance around the z-axis. Qi et al. (2016) introduce multi-resolution filtering for multi-view CNNs and improve the classification accuracy. Cao et al. (2017) propose spherical projections to collect depth variations and contour information of different views for better performances. Zhang et al. (2018) apply a PointNet-like (Qi et al. 2017a) method on multi-view 2.5D point clouds to fuse information from all views. View-GCN++ (Wei et al. 2022) exploits rotation robust view-sampling to deal with rotation sensitivity. Besides, some methods replaces weighted average in Eq. 5 with pooling/fusion modules to enhance effectiveness and efficiency (Wang et al. 2017; Roveri et al. 2018; Yu et al. 2018; Wei et al. 2020; Li et al. 2020; Chen and Chen 2022). These modifications do not necessarily improve the invariance, so we omit them here.

Fig. 3
figure 3

A pipeline of multi-view methods. The 3D input is first rendered/sampled into multi-view data, processed by non-invariant DNNs, and finally pooled for downstream tasks

Most multi-view methods take images as the input, so they can handle 3D rotation invariance using powerful 2D models (Su et al. 2015; Qi et al. 2016; Cao et al. 2017). Nonetheless, they lead to a heavy computational burden, making training and inference inefficient. As Eq. 5 shows, the computational burden of f is at least \(\left| \widehat{G}\right|\) times than that of \(f_\text {b}\). For instance, \(\left| \widehat{G}\right|\) is 12 or 80 in MVCNN (Su et al. 2015). In addition, most existing multi-view methods are weakly rotation invariant. Their base models \(f_\text {b}\) are not strongly rotation invariant, such as 2D CNNs (Su et al. 2015; Qi et al. 2016; Wei et al. 2022) and non-invariant 3D networks (Zhang et al. 2018). So the composite models f do not possess strong invariance.

3.3 Ringlike and cylindrical methods

Under some circumstances, it is straightforward to identify a principal axis. Thus, the 3D rotation invariance degenerates into the 2D one. These methods organize data in rings or cylinders for further processing.

The principal axis is either selected from xyz axes or computed using specific algorithms. DeepPano (Shi et al. 2015) takes z-axis as the principal axis and creates a panoramic view through a cylinder projection. A max-pooling layer is appended for invariance. Moon et al. (2018) extend 2D CNNs working on panoramic views to 3D CNNs working on cylindrical occupancy grids and get better performances. Cylindrical Transformer Networks (Esteves et al. 2018b) transform raw coordinates to cylindrical ones using the predicted z-axis. As the 3D convolutions acting on cylindrical coordinates are translation invariant, the final representations are rotation invariant. Many methods take this pipeline with slight modifications (Sun et al. 2019a; Ao et al. 2021; Fan et al. 2021; Xu et al. 2021c; Li et al. 2022b; Zhao et al. 2022b; Ao et al. 2023a).

Ringlike and cylindrical methods are compelling in applications like place recognition (Sun et al. 2019a; Li et al. 2022b) and registration (Ao et al. 2021; Zhao et al. 2022b). Nevertheless, their application scope is limited. They can only handle problems where the principal axes can be identified, or the inputs can fit into rings and cylinders.

3.4 Transformation methods

Transformation methods address rotation invariance through a transformation function \(t:\mathcal {X}\rightarrow \text {Aut}\left( \mathcal {X}\right)\), where \(\text {Aut}\left( \mathcal {X}\right)\) is the automorphism group of \(\mathcal {X}\). In transformation methods, the model \(f:\mathcal {X}\rightarrow \mathcal {Y}\) is given as

$$\begin{aligned} f\left( x\right) =f_\text {b}\left( t\left( x\right) \cdot x\right) , \end{aligned}$$
(7)

where \(f_\text {b}:\mathcal {X}\rightarrow \mathcal {Y}\) is the base model. If t satisfies the invariance condition, i.e.,

$$\begin{aligned} \forall x\in \mathcal {X},\forall g\in G,t\left( x\right) =t\left( g\cdot x\right) g, \end{aligned}$$
(8)

then f is strongly rotation invariant. However, t does not satisfy this condition in most methods, so f is only weakly rotation invariant. These methods are usually designed for coordinate inputs like point clouds.

Spatial Transformer Networks (STNs) (Jaderberg et al. 2015) are widely used for spatial invariance in image processing. In the 3D domain, PointNet (Qi et al. 2017a) proposes joint alignment networks, i.e., T-Net, for rotation robustness, as shown in Fig. 4. T-Net is a mini-PointNet regressing the transformation matrix directly. To make the matrix \(\varvec{R}\in \mathbb {R}^{3\times 3}\) orthogonal, a regularization term \(L_\text {reg}=\left\| \varvec{I}-\varvec{RR}^T\right\|\) is appended. There is no clear disparity between STNs and T-Nets in the 3D domain, so they are not distinguished in this survey. T-Net is widely adopted with the spread of PointNet-like methods (Qi et al. 2017a, b; Wang et al. 2019b). SHPR-Net (Chen et al. 2018b) employs two T-Nets to connect poses in the original and canonical spaces. PVNet (You et al. 2018) applies the EdgeConv (Wang et al. 2019b) as the basic blocks of T-Net to better capture local information. Zhang et al. (2018) put raw point clouds and multi-view features into T-Net to robustify the model. In addition, many other methods also include T-Net in their models for the effectiveness and stability in different downstream tasks (Joseph-Rivlin et al. 2019; Chen et al. 2019a; Liu et al. 2019c; Zhang et al. 2020a; Yu et al. 2020b; Wang et al. 2021; Poiesi and Boscaini 2021; Hegde and Gangisetty 2021; Liu et al. 2022c; Zhu et al. 2022a).

Fig. 4
figure 4

T-Net (Qi et al. 2017a) directly regresses rotation matrices from coordinates. BN refer to the number of batches and points, respectively. The numbers behind MLP are internal layer sizes

Besides rotation matrices, some methods utilize other rotation representations. IT-Net (Yuan et al. 2018) simultaneously canonicalizes rotation and translation through the quaternion representation. PCPNet (Guerrero et al. 2018) and SCT (Liu et al. 2022a) regress unit quaternions for pose canonicalization and point cloud recognition, respectively. Poiesi and Boscaini (2023) learn a quaternion transformation network to refine the estimated LRF. RotPredictor (Fang et al. 2020) applies PointConv (Wu et al. 2019) to regress Euler angles, and RTN (Deng et al. 2021b) predicts discrete Euler angles. C3DPO (Novotny et al. 2019) divides the shape into view-specific pose parameters and a view-invariant shape basis. PaRot (Zhang et al. 2023a) also disentangles invariant features with equivariant poses via the equivariance loss. Wang et al. (2022c) formulate the rotation invariant learning problem as the minimization of an energy function, solved with an iterative strategy. Some methods are embedded in a self-supervised learning framework. Some works (Zhou et al. 2022b; Mei et al. 2023) enforce the consistency of canonical poses with a rotation equivariance loss. Sun et al. (2021) utilize Capsule Networks (Hinton et al. 2011) with the canonicalization loss for object-centric reasoning. Kim et al. (2022) introduce a self-supervised learning framework to predict canonical axes of point clouds using the icosahedral group. Currently, only a few methods are strongly rotation invariant. LGANet (Gu et al. 2021b) and ELGANet (Gu et al. 2022) exploit graph convolutional networks (GCNs) to process rotation invariant distances and angles, where the outputs are orthogonalized into rotation matrices. Katzir et al. (2022) employ equivariant networks to learn canonical poses of point clouds. RIP-NeRF (Wang et al. 2023c) transforms raw coordinates into invariance one for fine-grained editing. EIPs (Fei and Deng 2024) disentangle rotation invariance and point cloud processing with efficient invariant poses.

Beneficial from their straightforward idea, transformation methods are extensively used in many applications (Liu et al. 2019c; Guerrero et al. 2018; Zhu et al. 2022a). Notwithstanding, the invariance condition is always ignored by some works, especially those using T-Nets (Qi et al. 2017a; Joseph-Rivlin et al. 2019; Poiesi and Boscaini 2021). Thus, the transformation functions have no contribution to the rotation invariance. Besides, some methods cannot output proper rotation representations. For example, T-Net (Qi et al. 2017a) cannot guarantee proper output rotation matrices, even using the regularization term. In this case, 3D shapes are inevitably distorted, and some structural information may be lost. Moreover, heavy data augmentation is sometimes required for good performance. Le (2021) shows that T-Net needs a large amount of data augmentation to learn a steady transformation policy.

3.5 Invariant value methods

Invariant value methods achieve rotation invariance through constructing invariant values from coordinate inputs. Here, invariant values include distances, inner products, and angles:

$$\begin{aligned} \left\| \varvec{u}_i\right\| \ \left( \text {distance}\right) ,\ \varvec{u}_i\cdot \varvec{u}_j\ \left( \text {inner product}\right) ,\ \angle \left( \varvec{u}_i,\varvec{u}_j\right) \ \left( \text {angle}\right) , \end{aligned}$$
(9)

where \(\left\{ \varvec{u}_i\right\} \subset \mathbb {R}^3\) is a nonzero geometric vector set. Based on these invariant values, the model \(f:\mathcal {X}\rightarrow \mathcal {Y}\) is generally set up as

$$\begin{aligned} f\left( x\right) =f_\text {b}\left( f_\text {i}\left( x\right) \right) , \end{aligned}$$
(10)

where \(f_\text {i}:\mathcal {X}\rightarrow \mathcal {Z}\) uses handcrafted rules to compute invariant values, and \(f_\text {b}:\mathcal {Z}\rightarrow \mathcal {Y}\) is the base model. Clearly, f is strongly rotation invariant. In the following discussions, \(\left\{ \varvec{x}_i\right\}\) represents a point cloud. \(\varvec{x}_{ij},\varvec{n}_{ij},\left( j=1,\cdots ,k\right)\) denote the positional and normal vectors of \(\varvec{x}_i\)’s kNN, respectively. \(\varvec{m}_i\) is the barycenter of \(\mathcal {N}\left( \varvec{x}_i\right)\). We use several operators to simplify the notation: normalize (N), orthogonalize (O), and orthonormalize (NO).

$$\begin{aligned} \text {N}\left( \varvec{x}\right) =\frac{\varvec{x}}{\left\| \varvec{x}\right\| },\text {O}\left( \varvec{x},\varvec{y}\right) =\varvec{y}-\left( \varvec{y}\cdot \text {N}\left( \varvec{x}\right) \right) \text {N}\left( \varvec{x}\right) ,\text {NO}\left( \varvec{x},\varvec{y}\right) =\text {N}\left( \text {O}\left( \varvec{x},\varvec{y}\right) \right) . \end{aligned}$$
(11)

As \(f_\text {b}\) is usually a deep point cloud model with slight modification, the handcrafted rules in \(f_\text {i}\) are the core of invariant value methods. We divide existing methods into several groups according to the form of invariant values.

Fig. 5
figure 5

Representative invariant values from a local values, b LRF-based values, c PPF-based values, and d global values. The solid lines are invariant values or necessary components of invariant values

3.5.1 Local values

Table 3 Some representative local values

Many methods generate invariant values in the local neighborhoods. ClusterNet (Chen et al. 2019b) introduces rigorously rotation invariant (RRI) mappings based on a kNN graph as

$$\begin{aligned} \text {RRI}\left( \varvec{x}_i,\left\{ \varvec{x}_{ij}\right\} _{j=1}^k\right) =&\left[ \left\| \varvec{x}_i\right\| ,\left\{ \left( \left\| \varvec{x}_{ij}\right\| ,\angle \left( \varvec{x}_i,\varvec{x}_{ij}\right) ,\phi _{ij}\right) \right\} _{j=1}^k\right] ,\\ \text {where }\phi _{ij}=&\min \left\{ \text {atan2}\left( a_{ijt},b_{ijt}\right) \vert 1\le t\le k,t\ne j,a_{ijt}\ge 0\right\} ,\nonumber \\ a_{ijt}=&\left( \text {NO}\left( \varvec{x}_i,\varvec{x}_{ij}\right) \times \text {NO}\left( \varvec{x}_i,\varvec{x}_{it}\right) \right) \cdot \text {N}\left( \varvec{x}_i\right) ,\nonumber \\ b_{ijt}=&\ \text {NO}\left( \varvec{x}_i,\varvec{x}_{ij}\right) \cdot \text {NO}\left( \varvec{x}_i,\varvec{x}_{it}\right) .\nonumber \end{aligned}$$
(12)

ClusterNet applies a hierarchical structure to aggregate features. Although all geometric information is retained, it mainly considers global information, weakening its capability to describe local structures. RIConv (Zhang et al. 2019b) addresses this issue by extracting local rotation invariant features (RIFs, Fig. 5a) via relative distances and angles as

$$\begin{aligned} \text {RIF}\left( \varvec{x}_{ij}\right) =\left[ \left\| \varvec{d}_{ij}^{\left( 0\right) }\right\| ,\left\| \varvec{d}_{ij}^{\left( 1\right) }\right\| ,\angle \left( \varvec{d}_{ij}^{\left( 0\right) },\varvec{d}_i^{\left( 0\right) }\right) ,\angle \left( \varvec{d}_{ij}^{\left( 1\right) },-\varvec{d}_i^{\left( 0\right) }\right) \right] , \end{aligned}$$
(13)

where \(\varvec{d}_{ij}^{\left( 0\right) }=\varvec{x}_{ij}-\varvec{x}_i,\varvec{d}_{ij}^{\left( 1\right) }=\varvec{x}_{ij}-\varvec{m}_i,\varvec{d}_i^{\left( 0\right) }=\varvec{m}_i-\varvec{x}_i\). It applies a multi-layer perceptron (MLP) to generate final features. RIF has been widely adopted by many works (Chou et al. 2021; Zhang et al. 2022; Wang and Rosen 2023; Fan et al. 2023).

Later work mainly adds more reference points and invariant values to improve performances. Some representative invariant values are collected in Table 3. Readers may refer to the original papers for details.

3.5.2 LRF-based values

LRF-based values are special cases of local values. Specially, if three orthogonal axes \(\varvec{e}_1,\varvec{e}_2,\varvec{e}_3\) can be determined in \(\mathcal {N}\left( \varvec{x}_i\right)\), then \(\varvec{x}_{ij}\cdot \varvec{e}_1,\varvec{x}_{ij}\cdot \varvec{e}_2,\varvec{x}_{ij}\cdot \varvec{e}_3\) are relative coordinates in this LRF. LRFs are adopted in many handcrafted 3D descriptors, like FPFH (Rusu et al. 2009), SHOT (Tombari et al. 2010), and RoPS (Guo et al. 2013). It should be noted that methods only using principal component analysis (PCA) to define LRFs are discussed separately in the next section instead of this one. We divide these methods according to the number of LRFs in each neighborhood.

Some methods define a unique LRF in each neighborhood. Usually, the normal vector is selected as one axis, a normalized weighted average vector is selected as another, and their cross product is chosen as the final axis. We summarize these methods in Table 4.

Table 4 Different LRFs adopted by LRF-based values with one LRF

Besides, there are also methods with multiple LRFs in each neighborhood. A common choice of LRF is the Darboux frame defined as

$$\begin{aligned} \varvec{e}_x=\varvec{n}_i,\varvec{e}_y\left( \varvec{x}_{ij}\right) =\text {N}\left( \varvec{d}_{ij}^{\left( 0\right) }\times \varvec{e}_x\right) ,\varvec{e}_z\left( \varvec{x}_{ij}\right) =\varvec{e}_x\times \varvec{e}_y\left( \varvec{x}_{ij}\right) , \end{aligned}$$
(14)

where \(\varvec{e}_y\) and \(\varvec{e}_z\) depend on not only \(\varvec{x}_i\) but also \(\varvec{x}_{ij}\). CRIN (Lou et al. 2023) proposes another LRF by considering the original space basis. Some representative invariant values are listed in Table 5.

Table 5 Some representative invariant values with multiple LRFs

3.5.3 PPF-based values

PPFs (Drost et al. 2010) are initially proposed in the 3D object recognition algorithm, which describe the relative information between two points \(\varvec{x}_1,\varvec{x}_2\) as

$$\begin{aligned} \text {PPF}\left( \varvec{x}_1,\varvec{x}_2\right) =\left[ \left\| \varvec{d}_{12}\right\| ,\angle \left( \varvec{n}_1,\varvec{d}_{12}\right) ,\angle \left( \varvec{n}_2,\varvec{d}_{12}\right) ,\angle \left( \varvec{n}_1,\varvec{n}_2\right) \right] , \end{aligned}$$
(15)

where \(\varvec{d}_{ij}=\varvec{x}_i-\varvec{x}_j\), as Fig. 5c shows. PPFs are strongly rotation invariant, making them suitable for invariant feature extraction.

PPFNet (Deng et al. 2018b) concatenates PPFs with coordinates and normals to improve the robustness of 3D point matching. PPF-FoldNet (Deng et al. 2018a) combines PPFNet with FoldingNet (Yang et al. 2018) to learn invariant descriptors, using only PPFs as input features. Bobkov et al. (2018) slightly modify and apply the PPFs to classification and retrieval. GMCNet (Pan et al. 2021) combines RRI (Chen et al. 2019b) and PPFs for rigorous partial point cloud registration. Using hypergraphs, Triangle-Net (Xiao and Wachs 2021) extend PPFs to three points (triangles). PaRI-Conv (Chen and Cong 2022) augments PPFs with two azimuth angles and uses them to synthesize pose-aware dynamic kernels. PPFs have been widely employed in rotation invariant point cloud matching and registration (Zhao et al. 2021; Yu et al. 2023; Zhang et al. 2023c).

3.5.4 Global values

Some methods do not require local neighborhoods to evaluate invariant values. SRINet (Sun et al. 2019b) defines point projection mapping (PPM, Fig. 5d) through projecting \(\varvec{x}_i\) on three axes \(\varvec{a}_1,\varvec{a}_2,\varvec{a}_3\) as

$$\begin{aligned} \text {PPM}\left( \varvec{x}_i\right) =\Big [\cos \angle \left( \varvec{a}_1,\varvec{x}_i\right) ,\cos \angle \left( \varvec{a}_2,\varvec{x}_i\right) ,\cos \angle \left( \varvec{a}_3,\varvec{x}_i\right) ,\left\| \varvec{x}_i\right\| \Big ], \end{aligned}$$
(16)

where \(\varvec{a}_1=\mathop {\arg \max }_{\varvec{x}\in \left\{ \varvec{x}_i\right\} }\left\| \varvec{x}\right\| , \varvec{a}_2=\mathop {\arg \min }_{\varvec{x}\in \left\{ \varvec{x}_i\right\} }\left\| \varvec{x}\right\| , \varvec{a}_3=\varvec{a}_1\times \varvec{a}_2\). Based on SRINet, Tao et al. (2021) add attention modules, and SCT (Liu et al. 2022a) adds a quaternion T-Net for better performances. Sun et al. (2023) apply SRINet on non-rigid point clouds. Some works (Xu et al. 2021b; Qin et al. 2023a) employ the sorted Gram matrix as invariant values. The Gram matrix for \(\left\{ \varvec{x}_i\right\} _1^N\) is computed as \(\left( \varvec{x}_i\cdot \varvec{x}_j\right) _{N\times N}\), each row of which is then sorted and fed into point-based networks for permutation and rotation invariance.

3.5.5 Others

In addition to the above invariant values, the other values that are hard to classify are listed here. SchNet (Schütt et al. 2017, 2018) gains rotation invariance through interatomic distances. SkeletonNet (Ke et al. 2017) uses angles and ratios between distances as invariant features for human skeletons. Liu et al. (2018) leverage relative distances on global point cloud registration. 3DTI-Net (Pan et al. 2019) utilizes translation invariant graph filter kernel and employs the norms as invariant features. 3DMol-Net (Li et al. 2021a) extends it to molecular applications. RISA-Net (Fu et al. 2020) employs edge lengths and dihedral angles on 3D retrieval tasks. RMGNet (Furuya et al. 2020) feeds several handcrafted descriptors into GCNs for point cloud segmentation. GS-Net (Xu et al. 2020) uses eigenvalue decomposition on local distance graphs and exploits these eigenvalues as invariant features. SN-Graph (Zhang et al. 2021b) leverages 15 cosine values, 7 distances, and 7 radii as invariant values. TinvNN (Zhang et al. 2021c) exercises eigenvalue decomposition on the zero-centered distance matrices to get invariant features. ComENet (Wang et al. 2022b) exploits several rotation angles for global completeness. DuEqNet (Wang et al. 2023b) builds equivariant networks through relative distances for object detection. SGPCR (Salihu and Steinbach 2023) explores the rotation invariant convolution between two spherical Gaussians for object registration and retrieval. RadarGNN (Fent et al. 2023) employs rotation invariant bounding boxes and representation for radar-based perception. GeoTransformer Qin et al. (2023b) further applies sinusoidal embedding on distances and angles for robust registration.

3.5.6 Discussion

Unlike the methods above, invariant value methods are strongly rotation invariant, and their superiority has been demonstrated with many experiments (Xu et al. 2021b; Chen and Cong 2022; Sahin et al. 2022; Wang et al. 2023b). Nevertheless, there are still several concerns.

Singularity Almost every method has singularities that make invariant values meaningless, including coincident points (e.g., \(\varvec{x}_i=\varvec{m}_i\Rightarrow \varvec{d}_i^{\left( 0\right) }=\varvec{0}\) leads to undefined angles in RIConv (Zhang et al. 2019b)), collinear vectors (e.g., if cross products in Cao et al. (2021); Chen and Cong (2022); Sahin et al. (2022) give zero output, then their LRFs are not properly defined), and nonunique candidate values (e.g. if two or more points are satisfying \(\mathop {\arg \max }_{\varvec{x}\in \left\{ \varvec{x}_i\right\} }\left\| \varvec{x}\right\|\), then \(\varvec{a}_1\) in SRINet (Sun et al. 2019b) is not determined).

Irreversibility For \(f_\text {i}:\mathcal {X}\rightarrow \mathcal {Z}\), if there exists \(f_\text {ri}:\mathcal {Z}\rightarrow \mathcal {X}\) satisfying

$$\begin{aligned} \forall x\in \mathcal {X},\exists \ g_x\in G,f_\text {ri}\left( f_\text {i}\left( x\right) \right) =g_x\cdot x, \end{aligned}$$
(17)

then \(f_\text {i}\) is reversible. Some irreversible invariant values may lose certain structural information, harming downstream task performances (Zhang et al. 2019b; Sun et al. 2019b).

Discontinuity The base model \(f_\text {b}\) is generally a continuous deep model. So if \(f_\text {i}\) is discontinuous at \(x_0\), then the model f may also be discontinuous at \(x_0\), making it hard to train with gradient-based optimization algorithms. For example, \(f_\text {i}\) in SRINet (Sun et al. 2019b) is discontinuous on point clouds whose two longest vectors are close, since it needs them to define axes.

Reflection Distances, inner products, and angles are invariant to rotations and reflections. Thus, almost all methods without cross products cannot distinguish rotations from reflections (Drost et al. 2010; Zhang et al. 2019b; Xu et al. 2021b).

3.6 PCA-based methods

PCA-based methods construct the model similarly to transformation methods, while the transformation function is unlearnable PCA alignment, as Algorithm 1 shows. \(\varvec{X}\) is usually zero-centered to mitigate the influence of translations and \(\varvec{\Sigma }\) is called the covariance matrix. PCA alignment can guarantee the rotation invariance. For \(\varvec{X}_R=\varvec{XR}\ \left( \varvec{RR}^T=\varvec{I}\right)\), if

$$\begin{aligned} \varvec{\Sigma }_R=\varvec{X}_R^T\varvec{X}_R=\varvec{R}^T\varvec{\Sigma R}=\left( \varvec{R}^T\varvec{V}\right) \varvec{\Lambda }\left( \varvec{R}^T\varvec{V}\right) ^T\Rightarrow \varvec{V}_R=\varvec{R}^T\varvec{V}, \end{aligned}$$
(18)

then \(\varvec{Z}_R=\varvec{X}_R\varvec{V}_R=\varvec{XV}=\varvec{Z}\). There are two conditions for Eq. 18. First, the eigenvalues must be distinct, i.e., \(\lambda _1>\lambda _2>\lambda _3\). As it is rare that two or three eigenvalues are equal, almost all methods assume it to be true. Second, the signs of all columns of \(\varvec{V}\) must be identified uniquely. If \(\varvec{V}=\left[ \varvec{v}_1,\varvec{v}_2,\varvec{v}_3\right]\) satisfies Eq. 19, then \(\varvec{V}\text {diag}\left( \varvec{s}\right) =\left[ s_1\varvec{v}_1,s_2\varvec{v}_2,s_3\varvec{v}_3\right] \left( \varvec{s}=\left[ s_1,s_2,s_3\right] ^T\in \left\{ -1,1\right\} ^3\right)\) also satisfies it. Some works substitute PCA with eigenvalue decomposition or singular value decomposition (SVD), but there is no substantial difference. In SVD, another matrix \(\varvec{U}=\varvec{XV}\sqrt{\varvec{\Lambda }}^{-1}\in \mathbb {R}^{N\times 3}\) is introduced. In this section, PCA-based methods are classified according to how the ambiguity of signs is handled (Fig. 6).

Algorithm 1
figure a

PCA Alignment

Most methods disambiguate signs through handcrafted rules, which generally involve dot products between \(\varvec{v}_k\) and other vectors. If \(\varvec{v}_k\rightarrow -\varvec{v}_k\Rightarrow s_k\rightarrow -s_k\), then \(s_k\varvec{v}_k\) remains the same. Some representative rules are listed in Table 6.

Table 6 Different disambiguation rules adopted by PCA-based methods. \(k=1,2,3\) unless otherwise specified

Some methods consider combinations of signs instead of just choosing one. Xiao et al. (2020) fuse all combinations through a self-attention module. OrthographicNet (Kasaei 2021) transforms raw points into canonical poses and generates several projection views for 3D object recognition. MolNet-3D (Liu et al. 2022c) average the results from 4 poses to predict the molecular properties. Puny et al. (2022) convert the group averaging operation to the subset averaging one with frames, where 4 and 8 frames are exploited for SO(3) and O(3), respectively. Li et al. (2023a) apply this approach on 3D planar reflective symmetry detection.

Some works utilize pose selectors to make one pose from multiple candidates. PR-invNet (Yu et al. 2020a) augments 8 poses with discrete rotation groups and utilizes the pose selector to choose the final pose. Li et al. (2021b) investigate the inherent ambiguity of PCA alignment. They argue that the order of \(\varvec{e}_x,\varvec{e}_y,\varvec{e}_z\) is also ambiguous and the total ambiguities is \(4\left( \text {sign}\right) \times 6\left( \text {order}\right) =24\). All poses are fused through a pose selector to create an optimal one. Besides coordinates, some works apply PCA on network weights (Xie et al. 2023) and the convex hull (Pop et al. 2023).

Fig. 6
figure 6

A pipeline of PCA-based methods. Several pose candidates are first generated from the 3D input, then they are either disambiguated using handcrafted rules/pose selectors or fused together

PCA-based methods are effective with intrinsic strong rotation invariance. Furthermore, they are always combined with invariant value methods for better performances (Yu et al. 2020a; Zhao et al. 2022a; Chen and Cong 2022). However, sign disambiguation may bring out problems like singularity and discontinuity in Sect. 3.5.6 (Zhang et al. 2020c; Fan et al. 2020; Gandikota et al. 2021), while considering all combinations would increase the computational burden (Xiao et al. 2020; Kasaei 2021). Besides, PCA-based methods are fragile to inputs with close eigenvalues since their eigenvectors are numerically unstable, which is an inherent problem of eigenvalue decomposition.

3.7 Summary

In a word, different methods use distinctive ways to obtain rotation invariance. Most rotation invariant methods are applied in 3D general understanding. We compare their differences in Table 7. Considering this, we summarize several characteristics of existing rotation invariant methods.

  • Data augmentation is always integrated with other methods, especially weakly rotation invariant ones (Fang et al. 2020; Deng et al. 2021b; Le 2021), to improve their invariance.

  • Multi-view methods only work with images and do not have advantages on coordinate inputs, since they are weakly invariant and usually introduce heavy computational burdens (Su et al. 2015; Qi et al. 2016; Zhang et al. 2018).

  • Ringlike and cylindrical methods are the best choices in tasks like place recognition (Sun et al. 2019a; Li et al. 2022b), as achieving 2D invariance is simpler than 3D.

  • Weakly rotation invariant transformation methods are less recommended. They can be replaced by PCA-based methods that have strong invariance and excellent performances.

  • Until now, strong invariance is only available by applying invariant value methods and PCA-based methods on coordinate inputs like point clouds and meshes.

Table 7 Comparisons of different rotation invariant methods

4 Rotation equivariant methods

Most of the rotation equivariant methods are equivariant networks on rotation groups. There are already surveys on geometrically equivariant graph neural networks (Han et al. 2022; Zhang et al. 2023b), categorizing them according to the way of message passing and aggregation. We devise a slightly different taxonomy to cover more related methods. Some milestone methods are listed in Fig. 7.

Fig. 7
figure 7

Milestones of rotation equivariant methods. Best viewed in color

4.1 G-CNNs

Group equivariant convolutional neural networks (G-CNNs) are first proposed to address 2D rotations in images (Cohen and Welling 2016). Moreover, they can be extended to 3D rotations directly. The group convolution for \(\psi ,f:\mathcal {X}\rightarrow \mathbb {R}\) is defined as

$$\begin{aligned} \left[ \psi \star f\right] \left( g\right) =\int _\mathcal {X}\left[ L_g\psi \right] \left( x\right) f\left( x\right) \text {d}x, \end{aligned}$$
(19)

where \(\left[ L_g\psi \right] \left( x\right) =\psi \left( g^{-1}\cdot x\right)\). The output signal is always defined on the rotation group, so \(\mathcal {X}=G\) in all convolutional layers except the first one. Group convolutions are strongly rotation equivariant, i.e., \(\psi \star \left[ L_gf\right] =L_g\left[ \psi \star f\right]\).

It is difficult to evaluate the integration directly, so many methods investigate group convolutions with finite groups. CubeNet (Worrall and Brostow 2018) focuses on convolutions on finite groups and reduces rotation equivariance to permutation equivariance. The group convolution for \(\psi ,f:\widehat{G}\rightarrow \mathbb {R}\) satisfies

$$\begin{aligned} \left[ \psi \star f\right] \left( \hat{g}_j\right) =L_{\hat{g}_i}\left[ \psi \star f\right] \left( \hat{g}_{k\left( i,j\right) }\right) =\left[ \psi \star L_{\hat{g}_i}f\right] \left( \hat{g}_{k\left( i,j\right) }\right) , \end{aligned}$$
(20)

where \(\widehat{G}\) is a finite rotation group and \(\hat{g}_{k\left( i,j\right) }=\hat{g}_i\hat{g}_j\). Therefore, rotation \(f\rightarrow L_{\hat{g}_i}f\) is equivalent to permutation \(j\rightarrow k\left( i,j\right)\) in the group convolution, as Fig. 8 shows. Esteves et al. (2019b) put multi-view features on vertices of the icosahedron and introduce localized filters in discrete G-CNNs for efficiency. EPN (Chen et al. 2021) combines point convolutions with group convolutions for SE(3) equivariance, and has been applied on object detection (Yu et al. 2022) and place recognition (Lin et al. 2022a, 2023a). G-CNNs are employed in many tasks, like medical image analysis (Winkels and Cohen 2018, 2019; Andrearczyk and Depeursinge 2018), point cloud segmentation (Meng et al. 2019; Zhu et al. 2023), pose estimation (Li et al. 2021d), and registration (Wang et al. 2022a, 2023a; Xu et al. 2023a).

Fig. 8
figure 8

The Cayley table of the tetrahedral group that satisfies \(\hat{g}_{k\left( i,j\right) }\,=\,\hat{g}_i\hat{g}_j\). Each different color represents a different rotation element. With the help of the Cayley table, it is straightforward to transform discrete rotation equivariance into permutation equivariance. Best viewed in color

Some methods utilize Lie groups to construct equivariant models. LieConv (Finzi et al. 2020) lifts raw inputs \(x\in \mathcal {X}\) to group elements \(g\in G\) and orbits \(q\in \mathcal {X}/G\) that \(g\cdot o_q=x\), where \(o_q\) is the origin of q. Thus, the convolution is defined as

$$\begin{aligned} \left[ \psi \star f\right] \left( g,q\right) =\int _G\int _{\mathcal {X}/G}\psi \left( g^{-1}g',q,q'\right) f\left( g',q'\right) \text {d}q'\text {d}\mu \left( g\right) ', \end{aligned}$$
(21)

where \(\psi :G\times \mathcal {X}/G\times \mathcal {X}/G\rightarrow \mathbb {R},f:G\times \mathcal {X}/G\rightarrow \mathbb {R}\). LieTransformer (Hutchinson et al. 2021) adds attention mechanisms to LieConv. After lifting, it computes content attention and location attention, both of which are normalized for feature transformation.

G-CNNs are effective tools for handling equivariance for voxels and point clouds (Worrall and Brostow 2018; Finzi et al. 2020; Chen et al. 2021). Nonetheless, it is difficult to balance the computational burden and the approximation error when using sampling to approximate the integration. Moreover, a finite subgroup of SO(3) is one of the following groups: the cyclic group \(C_k\left( \left| C_k\right| =k\right)\), the dihedral group \(D_k\left( \left| D_k\right| =2k\right)\), the tetrahedral group \(T\left( \left| T\right| =12\right)\), the octahedral group \(O\left( \left| O\right| =24\right)\), and the icosahedral group \(I\left( \left| I\right| =60\right)\) (Artin 2013). \(C_k,D_k\) can be large enough but are unsuitable for arbitrary 3D rotations, while VOI are applicable but cannot be as large as possible. Therefore, it is hard for methods that depend on finite subgroups to extend to arbitrary rotations, like CubeNet (Worrall and Brostow 2018).

4.2 Spherical CNNs

Spherical CNNs are special cases of G-CNNs, where the inputs are spherical and SO(3) signals. In this survey, existing spherical CNNs are divided into three categories, i.e., Cohen et al. (2018a), Esteves et al. (2018a), and the others (Fig. 9).

Fig. 9
figure 9

A pipeline of Spherical CNNs. Most works (Cohen et al. 2018a; Esteves et al. 2018a) employ tensor products to compute spherical/SO(3) convolutions in the spectral domain, while others directly apply spherical convolutions in the spatial domain

4.2.1 Cohen et al. (2018a)

Cohen et al. (2018a) directly employ group convolutions in Eq. 19, where \(\mathcal {X}\) is either \(S^2\) or SO(3). They use the generalized Fourier transform (GFT) to convert convolutions into matrix multiplications. GFT and its inverse are computed as

$$\begin{aligned} \tilde{f}_m^l=&\int _{S^2}f\left( x\right) \overline{Y_m^l\left( x\right) }\text {d}x,{} & {} f\left( x\right) =\sum _{l=0}^\infty \sum _{m=-l}^l\tilde{f}_m^lY_m^l\left( x\right) , \end{aligned}$$
(22)
$$\begin{aligned} \tilde{f}_{mn}^l=&\int _{\text {SO}\left( 3\right) }f\left( g\right) \overline{D_{mn}^l\left( g\right) }\text {d}\mu \left( g\right) ,{} & {} f\left( g\right) =\sum _{l=0}^\infty \sum _{m,n=-l}^l\tilde{f}_{mn}^lD_{mn}^l\left( g\right) , \end{aligned}$$
(23)

where \(l\ge 0, -l\le m,n\le l\). It can be proved that \(\widetilde{\varvec{\psi }\star \varvec{f}}^l=\tilde{\varvec{f}}^l\left( \tilde{\varvec{\psi }}^{l}\right) ^H\), where \(\tilde{\varvec{f}}^l,\tilde{\varvec{\psi }}^{l}\in \mathbb {C}^{2l+1}\) for spherical signals, \(\tilde{\varvec{f}}^l,\tilde{\varvec{\psi }}^{l}\in \mathbb {C}^{\left( 2l+1\right) \times \left( 2l+1\right) }\) for SO(3) signals, and \(\widetilde{\varvec{\psi }\star \varvec{f}}^l\in \mathbb {C}^{\left( 2l+1\right) \times \left( 2l+1\right) }\). The computation can be further accelerated with the generalized fast Fourier transform. Clebsch-Gordan Nets (Kondor et al. 2018) exploit the tensor product nonlinearity to avoid repeated transform, thus improving the efficiency. The tensor product between two steerable vectors \(\tilde{\varvec{u}}^{l_1}\in \mathbb {C}^{2l_1+1},\tilde{\varvec{v}}^{l_2}\in \mathbb {C}^{2l_2+1}\) is defined as

$$\begin{aligned} \left( \tilde{\varvec{u}}^{l_1}\otimes \tilde{\varvec{v}}^{l_2}\right) ^l_m=\sum _{m_1,m_2}C^{l,m}_{l_1,m_1,l_2,m_2}\tilde{u}^{l_1}_{m_1}\tilde{v}^{l_2}_{m_2}, \end{aligned}$$
(24)

where \(C^{l,m}_{l_1,m_1,l_2,m_2}\) is the Clebsch-Gordan coefficient, \(\left| l_1-l_2\right| \le l\le l_1+l_2,-l\le m\le l\). Tensor product is strongly rotation equivariant, i.e., \(\left( \varvec{D}^{l_1}\left( g\right) \tilde{\varvec{u}}^{l_1}\otimes \varvec{D}^{l_2}\left( g\right) \tilde{\varvec{v}}^{l_2}\right) ^l=\varvec{D}^l\left( g\right) \left( \tilde{\varvec{u}}^{l_1}\otimes \tilde{\varvec{v}}^{l_2}\right) ^l.\)

Many methods are based on spherical convolutions (Cohen et al. 2018a). \(a^3\)SCNN (Liu et al. 2019a) proposes the alt-az anisotropic spherical convolution (\(a^3\)SConv), whose outputs are spherical but not SO(3) signals. \(a^3\)SConv (\(\star _1\)) is defined as \(\left[ \psi \star _1 f\right] \left( x\right) =\left[ \psi \star f\right] \left( \zeta \left( x,0\right) \right)\), where \(\zeta :S^2\times \left[ 0,2\pi \right) \rightarrow \text {SO}\left( 3\right)\). As \(\zeta \left( x,0\right)\) cannot represent all SO(3) elements, \(a^3\)SConv is only equivariant to specific rotations. Esteves et al. (2020b) introduce spin weights and propose the spin-weighted spherical CNN. PRIN (You et al. 2020, 2021) propose spherical voxel convolution (SVC) for signals on the unit ball \(B^3\). SVC (\(\star _2\)) is defined as \(\left[ \psi \star _2 f\right] \left( x\right) =\left[ \psi \star f\right] \left( \iota \left( x\right) \right)\), where \(\iota :B^3\rightarrow \text {SO}\left( 3\right)\). SPRIN (You et al. 2021) abandons the dense grids in PRIN by directly converting point clouds \(\left\{ x_i\right\}\) into a distribution function \(f\left( x\right) =\frac{1}{N}\sum _{i}\delta \left( x-x_i\right)\), where \(\delta\) is the delta function. Then SVC can be efficiently approximated as an unbiased estimation. Chen et al. (2023) combines spherical CNNs with Capsule Networks (Hinton et al. 2011) for unknown pose recognition.

Most methods use the ray casting to generate spherical signals from 3D shapes. However, other methods are also applicable. Yang et al. (2019); Yang and Chakraborty (2020) generate spherical signals by collecting responses from point clouds. Spherical-GMM (Zhang 2021) represents point clouds with Gaussian mixture models. Besides classification and segmentation, spherical CNNs are widely used in many tasks, including omnidirectional localization (Zhang et al. 2021a), place recognition (Yin et al. 2020, 2021, 2022), and self-supervised representation learning (Spezialetti et al. 2019; Marcon et al. 2021; Lohit and Trivedi 2020; Spezialetti et al. 2020).

4.2.2 Esteves et al. (2018a)

Esteves et al. (2018a, 2020a) propose another spherical convolution only processing spherical signals. The spherical convolution for \(\psi ,f:S^2\rightarrow \mathbb {R}\) is defined as

$$\begin{aligned} \left[ \psi *f\right] \left( x\right) =\int _{G}L_{g}\psi \left( x\right) L_{g^{-1}}f\left( \eta \right) \text {d}\mu \left( g\right) , \end{aligned}$$
(25)

where \(\eta\) is the north pole. Such spherical convolutions are strongly rotation equivariant, i.e., \(\psi *\left[ L_gf\right] =L_g\left[ \psi *f\right]\), which can be converted to multiplications with GFT as \(\widetilde{\psi *f}^l_m=2\pi \sqrt{\frac{4\pi }{2l+1}}\tilde{\psi }_0^l\tilde{f}^l_m\). As only \(\tilde{\psi }_0^l\) is involved, the only useful part is the zonal component of the filter \(\psi\).

Esteves et al. (2019a) utilize pre-trained spherical CNNs as supervision and learn equivariant representations for 2D images. Mukhaimar et al. (2022) apply them on concentric spherical voxels for robust point cloud classification. Esteves et al. (2023) scale up spherical CNNs and achieve outstanding performances on molecular benchmarks and weather forecasting tasks.

4.2.3 Others

Some spherical CNNs keep GFT and part of spherical convolutions. Zhang et al. (2019a) replace the SO(3) convolutional layers with PointNet-like (Qi et al. 2017a) networks. Almakady et al. (2020) use GFT to decompose the spherical signals, then exploit the norms of individual components as invariant features for volumetric texture classification. Lin et al. (2021b) combine these norms with other invariant features to boost the classification performance.

Some spherical CNNs handle convolutions in the spatial domain. SFCNN (Rao et al. 2019) apply symmetric convolutions to each point and its neighbors on spherical lattices. Yang et al. (2020) propose the geodesic icosahedral pixelization to address the irregularity problem. Fox et al. (2022) transform point clouds into concentric spherical signals and append convolutions along the radial dimension. Shakerinava and Ravanbakhsh (2021) investigate the pixelizations of platonic solids for spheres and introduce equivariant maps on them. Xu et al. (2022) exploit global–local attention-based convolutions for spherical data.

4.2.4 Discussion

Spherical CNNs are effective for spherical signals. They have a solid mathematical foundation and nice properties on equivariance. Notwithstanding, preprocessing is sometimes problematic. The ray casting technique is commonly adopted to convert 3D shapes into spherical signals (Cohen et al. 2018a; Esteves et al. 2018a). However, Esteves et al. (2018a) argue that it is only suitable for star-shaped objects, from whose interior point the whole boundary is visible. Besides, projection on spheres would unavoidably distort shapes, and finer grids lead to less error but a heavier computational burden (Cohen et al. 2018a; Esteves et al. 2018a).

4.3 Irreducible representation methods

Irreducible representation methods utilize irreducible representations of SO(3), i.e., Wigner-D matrices \(\varvec{D}^l,l=0,1,\cdots\), to achieve rotation equivariance. A degree-l steerable feature \(\tilde{\varvec{u}}^l\) would transform into \(\varvec{D}^l\left( g\right) \tilde{\varvec{u}}^l\) under \({g}{\in}\)SO(3). In these methods, the degree-l filter \(\varvec{F}^l:\mathbb {R}^3\rightarrow \mathbb {C}^{2l+1}\) is constructed as

$$\begin{aligned} \varvec{F}^l\left( \varvec{x}\right) =\varphi ^l\left( \left\| \varvec{x}\right\| \right) \varvec{Y}^l\left( \frac{\varvec{x}}{\left\| \varvec{x}\right\| }\right) \ \ \ \ \ \varvec{x}\ne \varvec{0}, \end{aligned}$$
(26)

where \(\varphi ^l:\mathbb {R}_{\ge 0}\rightarrow \mathbb {R}\) and \(\varvec{Y}^l\) is the spherical harmonic. To guarantee the continuity, \(\varvec{F}^l\left( \varvec{0}\right)\) is determined by \(\lim _{\varvec{x}\rightarrow \varvec{0}}F^l\left( \varvec{x}\right)\), which is nonzero only when \(l=0\). \(\varvec{F}^l\) is strongly rotation equivariant, i.e., \(\varvec{F}^l\left( \varvec{R}\left( g\right) \varvec{x}\right) =\varvec{D}^l\left( g\right) \varvec{F}^l\left( \varvec{x}\right)\).

Fig. 10
figure 10

TFN (Thomas et al. 2018; Thomas 2019) layer. Each point \(\varvec{x}_i\) is associated with a tensor field \(\varvec{V}_i\). The output tensor field \(\varvec{V}'_i\) is aggregated from the tensor product between the filter features \(\varvec{F}(\varvec{x}_i-\varvec{x}_j)\) and the input tensor field \(\varvec{V}_j\). Some superscripts and subscripts are omitted for simplicity

Irreducible representation methods are mostly applied to coordinate inputs like point clouds. Tensor field networks (TFNs) (Thomas et al. 2018; Thomas 2019) are the pioneering methods using irreducible representations. All inputs and outputs of the TFN layer are tensor fields \(\widetilde{\varvec{V}}^l\in \mathbb {R}^{N\times C_l\times \left( 2\,l+1\right) }\), where N is the number of points, \(C_l\) is the feature dimension, and \(l=0,\cdots ,L\) is the rotation degree. They exploit TFN filters to generate steerable features from coordinates. Then the tensor product between these features and input tensor fields is computed as the output tensor fields, as shown in Fig. 10. TFNs and Clebsch-Gordan Nets (Kondor et al. 2018) have many similarities, including steerable features and tensor products. However, TFNs bind steerable features with points, while Clebsch-Gordan Nets exploit steerable features to describe spherical signals. N-body Networks (Kondor 2018), designed for many body physical systems, are also based on the irreducible representation of SO(3). Cormorant (Anderson et al. 2019) modifies the nonlinearity in Clebsch-Gordan Nets (Kondor et al. 2018) to avoid the blow-up of channels. SE(3)-Transformer (Fuchs et al. 2020) decomposes the TFN layer into self-interaction and message-passing, where attention is added to the second part. TF-Onet Chatzipantazis et al. (2023) also uses equivariant attention modules for shape reconstruction. Poulenard and Guibas (2021) propose a new nonlinearity for steerable features to improve the expressivity and reduce the computational burden. TFNs are leveraged in many applications, including 3D shape analysis (Poulenard et al. 2019), protein structure prediction (Fuchs et al. 2021), molecular dynamics simulation (Batzner et al. 2022), and self-supervised canonicalization (Sajnani et al. 2022).

Besides point clouds, irreducible representation methods are also applied to voxels. 3D Steerable CNNs (Weiler et al. 2018) reduce rotation equivariant linear maps between irreducible features into convolutions with steerable kernels \(\varvec{W}^{ll'}:\mathbb {R}^3\rightarrow \mathbb {R}^{\left( 2\,l+1\right) \times \left( 2\,l'+1\right) }\) that satisfy

$$\begin{aligned} \varvec{W}^{ll'}\left( \varvec{R}\left( g\right) \varvec{x}\right) =\varvec{D}^l\left( g\right) \varvec{W}^{ll'}\left( \varvec{x}\right) \varvec{D}^{l'}\left( g\right) ^{-1}. \end{aligned}$$
(27)

Eq. 27 can be solved analytically with the solution as a TFN-type matrix function. 3D Steerable CNNs are employed in some applications, including 3D texture analysis (Andrearczyk et al. 2019), partial point cloud classification (Xu et al. 2023b), and multiphase flow demonstration (Siddani et al. 2021; Lin et al. 2021a). PDO-s3DCNNs (Shen et al. 2022) derive the general steerable 3D CNNs with partial differential operators.

Irreducible representation methods have intrinsic strong rotation equivariance. Nonetheless, the theory is so complex as to limit the audience (Weiler et al. 2018; Thomas et al. 2018; Thomas 2019). Besides, tensor products may increase the number of the rotation degree and harm the efficiency Thomas et al. (2018); Thomas (2019); Kondor et al. (2018).

4.4 Equivariant value methods

Equivariant value methods are networks constructed by equivariant values, i.e., scalars and vectors. They are similar to invariant value methods in Sect. 3.5. However, invariant values are only primitive features, while equivariant values form the basic blocks of equivariant networks.

EGNNs (Satorras et al. 2021b) add relative distances to graph convolutional layers. Then the coordinate \(\varvec{x}_i\) and feature \(\varvec{f}_i\) are updated as

$$\begin{aligned} \varvec{m}_{ij}=\phi _e&\left( \varvec{f}_i,\varvec{f}_j,\left\| \varvec{x}_i-\varvec{x}_j\right\| ^2,\varvec{a}_{ij}\right) , \end{aligned}$$
(28)
$$\begin{aligned} \varvec{x}_i+\frac{1}{N-1}\sum _{j\ne i}\left( \varvec{x}_i-\varvec{x}_j\right) \phi _x&\left( \varvec{m}_{ij}\right) \rightarrow \varvec{x}_i,\ \phi _f\left( \varvec{f}_i,\sum _{j\in \mathcal {N}\left( \varvec{x}_i\right) }\varvec{m}_{ij}\right) \rightarrow \varvec{f}_i, \end{aligned}$$
(29)

where \(\varvec{a}_{ij}\) is the edge information, \(\phi _e,\phi _x,\phi _f\) are update functions for edges, coordinates, and node features, respectively. Clearly, the coordinates are strongly rotation equivariant, while the features are strongly rotation invariant. E-NFs (Satorras et al. 2021a) combine EGNNs with continuous-time normalizing flows (Chen et al. 2018a) to construct equivariant generative models. EquiDock (Ganea et al. 2022) and EquiBind (Stärk et al. 2022) apply graph matching networks (Li et al. 2019b) and EGNNs on rigid body protein-protein docking and drug binding structure prediction, respectively. Some methods (Hoogeboom et al. 2022; Schneuing et al. 2022; Igashov et al. 2022; Lin et al. 2022b; Guan et al. 2023) incorporate diffusion models with EGNNs for molecule generation. SEGNNs (Brandstetter et al. 2022) extend EGNNs with steerable features.

Vector Neurons (VNs) (Deng et al. 2021a) endow networks with equivariance by replacing scalars with vectors. Take the linear layer as an example, \(\varvec{v}\in \mathbb {R}^C\) is transformed into \(\varvec{Wv}+\varvec{b}\in \mathbb {R}^{C'}\) in classic networks, and \(\varvec{V}\in \mathbb {R}^{C\times 3}\) is transformed into \(\varvec{WV}\in \mathbb {R}^{C'\times 3}\) in VNs, where \(\varvec{W}\in \mathbb {R}^{C'\times C},\varvec{b}\in \mathbb {R}^{C'}\) (Fig. 11). Other layers are modified analogously. VN-Transformer (Assaad et al. 2022) derives equivariant attention mechanisms to enhance effectiveness and efficiency based on VNs. VNs are strongly rotation equivariant and have been applied in object manipulation (Simeonov et al. 2022), molecule generation (Huang et al. 2022b), point cloud registration (Zhu et al. 2022b; Lin et al. 2023b; Ao et al. 2023b), point cloud completion (Wu and Miao 2022), unsupervised point cloud segmentation (Lei et al. 2023), and point cloud canonicalization (Katzir et al. 2022; Kaba et al. 2023). Geometric vector perceptrons (GVPs) (Jing et al. 2021b) similarly operate on geometric vectors. Jing et al. (2021a) apply GVPs on structural biology tasks and reach several state-of-the-art results. PaiNN (Schütt et al. 2021) builds efficient equivariant layers to predict molecular properties. SE(3)-DDM (Liu et al. 2022b) applies PaiNN on the coordinate denoising task. TorchMD-NET (Thölke and Fabritiis 2022) designs attention-based update rules for features of different types. Directed weight neural networks (Li et al. 2022a) generalize VNs and GVPs with more operators, which can be integrated with existing GNN frameworks. Chen et al. (2022) build graph implicit functions with equivariant layers to capture geometric details. Le et al. (2022b) exploit cross products to generate new vectors in the message function.

Fig. 11
figure 11

The comparison between linear layers in typical networks (left) and VNs (right) (Deng et al. 2021a). Each solid line represents a weight value. As the vectors are transformed consistently, VNs can achieve strong rotation equivariance

Villar et al. (2021) utilize several theorems to construct equivariant functions on groups including O(n) and SO(n). GMN (Huang et al. 2022a) constructs equivariant networks similarly and proves their universal approximation. IsoGCNs (Horie et al. 2021) achieve equivariance through operating rank-p tensors \(H^p\in \mathbb {R}^{\left| \mathcal {V}\right| \times C\times d^p}\). Using a similar approach, Finkelshtein et al. (2022) define ascending and descending layers for geometric dimension expansion and contraction, respectively. Suk et al. (2021, 2022) leverage equivariant neural networks in computational fluid dynamics. EQGAT (Le et al. 2022a) processes coordinates with attention mechanisms for better performances. Luo et al. (2022) extend message passing networks with learned orientations. DeepDFT (Jørgensen and Bhowmik 2022) employs message passing networks on fast electron density estimation.

Compared to previous methods, equivariant value methods do not introduce approximation error and their theories are relatively simple (Satorras et al. 2021b; Deng et al. 2021a). Albeit recently emerged, they have shown great potential in many applications (Deng et al. 2021a; Ganea et al. 2022; Stärk et al. 2022; Schütt et al. 2021).

4.5 Others

Some equivariant networks use quaternions to represent 3D rotations. REQNN (Shen et al. 2020) employs quaternions to revise classic layers into equivariant ones. Zhao et al. (2020a) propose quaternion equivariant capsule networks to disentangle geometry from poses. Quaternion CNNs (Jing et al. 2021c) utilize convolutions on quaternion arrays for gait identification. Qin et al. (2022) present quaternion product units to address rotation equivariance.

Some methods turn to gauges for rotation equivariance. Gauge equivariant CNNs (Cohen et al. 2019a) propose gauge equivariant convolutions based on the gauge theory. Haan et al. (2021) adapts the above structure to mesh inputs. Gauge equivariant transformer (He et al. 2021) adds attention mechanisms to gauge equivariant CNNs for better performances.

Finzi et al. (2021) derive the equivariant condition like that in 3D Steerable CNNs (Weiler et al. 2018) with Lie algebra representations. EqDDM (Azari and Erdogmus 2022) leverage these constraints to build an equivariant deep dynamical model for motion prediction. Melnyk et al. (2022) establish steerability constraints for spherical neurons to construct equivariant layers.

Li et al. (2019a) take a similar approach to CubeNet (Worrall and Brostow 2018) but without group convolutions, where invariance is achieved by eliminating the permutation. XEdgeConv (Weihsbach et al. 2022) directly explores symmetric kernels for discrete rotation equivariance. Park et al. (2022) design equivariant networks for domains where it is hard to describe the transformation of inputs explicitly.

4.6 Summary

Rotation equivariant methods have a broader application range compared to rotation invariant ones. The differences of various rotation equivariant methods are listed in Table 8. We summarize several characteristics of existing rotation equivariant methods.

  • The approximation error of G-CNNs (Finzi et al. 2020; Chen et al. 2021) and spherical CNNs (Cohen et al. 2018a; Esteves et al. 2018a) are inevitable, which can only be reduced through fine discretization and cumbersome computation. Therefore, they are less reliable than strongly rotation equivariant methods.

  • Albeit strongly rotation equivariant, irreducible representation methods (Thomas et al. 2018; Weiler et al. 2018; Thomas 2019) have a complex theory, which poses great challenges for fresh users.

  • Equivariant value methods (Satorras et al. 2021b; Deng et al. 2021a) achieve great balance between theoretical properties and experimental performances.

Table 8 Comparisons of different rotation equivariant methods

5 Application and dataset

Rotation invariance and equivariance are seldom separate problems and always depend on task requirements in specific settings. We give a general overview of applications and datasets involved in related works and divide them into 3D semantic understanding and molecule-related applications.

5.1 3D semantic understanding

3D semantic understanding tasks, like classification, segmentation, and detection, evaluate the capability of DNNs on 3D shapes and scenes. Here we focus on tasks requiring rotation invariance and equivariance. We summarize these tasks and related datasets in Table 9. For aligned datasets, rotation augmentation is required to pose enough challenges on rotation invariant and equivariant methods. In the following discussions, A/B and \(\frac{A}{B}\) refer to training with A augmentation and evaluation with B augmentation. We use z and SO(3) to represent azimuthal and random rotation augmentation, respectively.

Table 9 Tasks and datasets in general 3D understanding
Table 10 ModelNet40 (Wu et al. 2015) classification results of representative rotation invariant/equivariant methods
Table 11 ScanObjectNN (Uy et al. 2019) classification results of representative rotation invariant/equivariant methods
Table 12 ShapeNetPart (Yi et al. 2016) segmentation results of representative rotation invariant/equivariant methods

Classification. Classification is the most well-studied task in this field. ModelNet (Wu et al. 2015) is a commonly-used 3D CAD model dataset with two versions, i.e., ModelNet10 with 10 categories and ModelNet40 with all 40 ones. We list the experimental results of ModelNet40 classification in Table 10. As the table shows, there is an input type change from images and meshes to point clouds, which can be attributed to the fact that point clouds can provide precise coordinates essential for strong rotation invariance. Besides, rotation invariant methods generally perform better than rotation equivariant ones. They are more suitable for tasks that only require prediction of invariant targets. ShapeNetCore (Chang et al. 2015) is another popular 3D shape dataset with 55 categories. Unlike previous datasets, ScanObjectNN (Uy et al. 2019) is a real-world point cloud dataset, adding more challenges to classification. ScanObjectNN has three popular variants, i.e., OBJ_ONLY, OBJ_BG, PB_T50_RS. Many researchers also evaluate their methods on these datasets, whose experimental results are summarized in Table 11. Fewer works explore ScanObjectNN compared to ModelNet40 (Wu et al. 2015). Besides, there is no consensus on which variant to evaluate and still has much room for improvement. Other datasets are used less frequently, like RGB-D Object (Lai et al. 2011), S3DIS (Armeni et al. 2016), and ScanNet (Dai et al. 2017). Some methods, especially those processing spherical signals, use Spherical MNIST (Cohen et al. 2018a) to evaluate their performances. Yang et al. (2020) create Spherical CIFAR-10 to experiment on photorealistic images. Andrearczyk and Depeursinge (2018); Almakady et al. (2020) exploit RFAI (Paulhac et al. 2009) on 3D texture classification. Yang and Chakraborty (2020) employ the OASIS (Fotenos et al. 2005) for medical image classification.

Segmentation. Segmentation is another popular task, aiming to make fine-grained prediction. In part segmentation for small-scale objects, ShapeNetPart (Yi et al. 2016) is widely applied as the evaluation dataset, where two common metrics, i.e., instance mean IoU (ins.) and class mean IoU (cls.) are generally used. As shown in Table 12, RIConv++ (Zhang et al. 2022) and PaRI-Conv (Chen and Cong 2022) set the state-of-the-art results in class mean IoU and instance mean IoU, respectively. However, we also notice that the differences in evaluation metrics make direct comparisons of various methods confusing and unfair, which should be avoided in future works. Performance gaps still exist between rotation invariant methods and rotation equivariant ones, since part segmentation tasks only require point-wise invariant prediction. Hegde and Gangisetty (2021) employ PartNet (Mo et al. 2019) for a more thorough evaluation. Besides, Zhuang et al. (2019); Zhu et al. (2020) investigate BraTS-2018 (Menze et al. 2015) on brain tumor segmentation. In semantic segmentation for large-scale scenes, S3DIS (Armeni et al. 2016), ScanNet (Dai et al. 2017), Semantic3D (Hackel et al. 2017), and 2D-3D-S (Armeni et al. 2017) are commonly used.

Detection. Detection is a basic task but needs to be more exploited when considering rotation invariance and equivariance. Some works (Yu et al. 2022; Wang et al. 2023b) incorporate equivariant networks with 3D object detectors. These methods are applied on indoor datasets like ScanNetV2 (Dai et al. 2017), SUN RGB-D (Song et al. 2015) and outdoor datasets like KITTI (Geiger et al. 2012), nuScenes (Caesar et al. 2020). Besides, Winkels and Cohen (2018); Andrearczyk et al. (2019, 2020) investigate the pulmonary nodule detection task with LIDC/IDRI (McNitt-Gray et al. 2007) and NLST (Team 2011).

Pose Estimation. The targets for pose estimation are pose parameters. Many aligned datasets can be adjusted for pose estimation, including ModelNet (Wu et al. 2015), ShapeNet (Chang et al. 2015), and ObjectNet3D (Xiang et al. 2016). Besides general shapes, some works focus on the pose estimation of specific objects. Xu et al. (2021a) employ Human3.6M (Ionescu et al. 2014) and MPI-INF-3DHP (Mehta et al. 2017) for human pose estimation. Chen et al. (2018b) regress hand poses on ICVL (Tang et al. 2014), NYU (Tompson et al. 2014), and MSRA (Sun et al. 2015).

Shape Registration. Registration is matching among multiple inputs. 3DMatch (Zeng et al. 2017) is an well-known registration benchmark composed of 7Scenes (Shotton et al. 2013) and SUN3D (Xiao et al. 2013). Liu et al. (2018); Melzi et al. (2019) investigate registration on the Stanford 3D Scanning Repository (Curless and Levoy 1996). Melzi et al. (2019) exploit TOSCA (Bronstein et al. 2008), FAUST (Bogo et al. 2014), and TOPKIDS (Lähner et al. 2016) on deformable shape registration.

Place Recognition. Place recognition is a special case of registration through matching with maps. KITTI (Geiger et al. 2012) includes a series of benchmarks of autonomous driving, where the odometry benchmark is generally adopted to evaluate the place recognition performance. Many datasets are also leveraged for a comprehensive evaluation, including ETH (Pomerleau et al. 2012), NCLT (Carlevaris-Bianco et al. 2016), SceneCity (Zhang et al. 2016), Oxford RobotCar (Maddern et al. 2017), MulRan (Kim et al. 2020a), KITTI-360 (Liao et al. 2022).

Reconstruction. Reconstruction is a pre-training task adopted by many self-supervised methods. Much work (Shen et al. 2020; Deng et al. 2021a; Sun et al. 2021; Zhou et al. 2022b) carries out the reconstruction experiment on ShapeNetCore (Chang et al. 2015). In addition, Yu et al. (2020b) utilize ModelNet40 (Wu et al. 2015) for point cloud inpainting and completion.

Retrieval. Retrieval is the task of finding similar objects to the query object. SHREC’ 17 (Savva et al. 2017) is a famous retrieval challenge based on the ShapeNetCore (Chang et al. 2015). Some methods (Su et al. 2015; Esteves et al. 2019b; Wei et al. 2020) also experiment on ModelNet (Wu et al. 2015).

Others. Ke et al. (2017) use the NTU RGB+D (Shahroudy et al. 2016), the SBU kinect interaction (Yun et al. 2012), and the CMU dataset (CMU 2002) for skeleton action recognition. Qin et al. (2022) apply FPHA (Garcia-Hernando et al. 2018) on hand action recognition. Besides, some methods (Liu et al. 2019b; Zhang et al. 2020c; Yang et al. 2021) exploit ModelNet40 (Wu et al. 2015) on normal estimation. Esteves et al. (2023) employ the WeatherBench (Rasp et al. 2020) to evaluate large spherical CNNs (Esteves et al. 2018a) on weather forecasting.

5.2 Molecule-related application

Recently, the number of papers that employ rotation equivariant networks on molecular data grows explosively. The physical and chemical laws determine the relative but not absolute positions of atoms. Therefore, rotation invariance and equivariance are inherently needed in molecule-related applications. As related work goes further, many new tasks are investigated, and we only summarize some representative ones. Tasks and datasets are listed in Table 13.

Table 13 Tasks and datasets in molecule-related applications
Table 14 QM9 (Ramakrishnan et al. 2014) prediction mean absolute error of representative rotation invariant/equivariant methods

Prediction Prediction is to predict molecular properties giving molecular structures. QM7 (Blum and Reymond 2009; Rupp et al. 2012) is a small and pioneering dataset used by some works (Liu et al. 2022c; Kondor et al. 2018). QM9 (Ramakrishnan et al. 2014) is a commonly-used dataset, including 134k molecules with geometric, energetic, electronic, and thermodynamic properties. As shown in Tab. 14, there are more rotation equivariant methods than rotation invariant ones in this prediction task. As related research dives further, novel methods with powerful and sophisticated structures show great potential in decreasing the mean absolute error of molecular property prediction. ATOM3D (Townshend et al. 2021) is a set of benchmarks including various tasks. Other datasets, including MD17 (Chmiela et al. 2017), ISO17 (Schütt et al. 2017), ESOL (Delaney 2004), BACE (Subramanian et al. 2016), PDB (Berman et al. 2003), and OC20 (Zitnick et al. 2020), are also applied in different prediction tasks.

Generation In generation, the model is required to generate molecules according to certain requirements. Thomas et al. (2018) employ random deletion on QM9 (Ramakrishnan et al. 2014) and validate the model with an inpainting task. Jing et al. (2021b); Li et al. (2022a) exploit CATH 4.2 (Ingraham et al. 2019) and TS50 (Li et al. 2014) on computational protein design. Du et al. (2021) employ subsets of GEOM (Axelrod and Gómez-Bombarelli 2022) on conformation generation tasks. Satorras et al. (2021a) utilize LJ-13 (Köhler et al. 2020) on 3D states generation.

Others Jing et al. (2021b); Li et al. (2022a) apply CASP (Cheng et al. 2019) on model quality assessment. Poulenard et al. (2019) leverage PDB (Berman et al. 2000) on RNA segmentation. Ganea et al. (2022) exploit DB5.5 (Vreven et al. 2015) and DIPS (Townshend et al. 2019) on rigid protein docking.

6 Future direction

Here we point out several future research directions inspired from unsolved problems in the presence of methods and tasks.

6.1 Method

The pros and cons of existing methods have been summarized in Sects. 3 and  4. Therefore, the future method should perform better and avoid any previous drawbacks by possessing the following properties.

  • Strong rotation invariance and equivariance. This survey includes weakly invariant and equivariant methods in discussing rotation invariance and equivariance for the first time. Nonetheless, we argue only to use these methods if necessary. They involve redundant uncertainties and cannot deliver consistent results for the same inputs with different poses.

  • Concise mathematical background. The theory of many existing methods is too verbose and complicated. It should be simplified, especially when they have little connection with the implementation. Any novel method should avoid exploring general but unrelated theories.

  • High computational efficiency. Due to the high latency, many well-performed methods cannot be employed in practical applications. As the research progresses to large-scale and complex data, the latest work should consider such application scenarios and be as efficient as possible.

  • Reliable integrability. Many successful DNNs have been developed for numerous applications where rotation invariance and equivariance are not considered. Therefore, they are only suitable for aligned data. If lately-developed methods can be integrated with these models straightforwardly, then the composite models would benefit from both.

6.2 Theoretical analysis

Most of the existing theoretical analysis addresses strong invariance and equivariance. Some methods propose mathematical frameworks to construct equivariant networks (Kondor and Trivedi 2018; Cohen et al. 2018b, 2019b; Esteves 2020; Aronsson 2021; Gerken et al. 2021; Winter et al. 2022). However, the discussion on universal approximation is quite limited (Dym and Maron 2021), and most equivariant networks do not have solid mathematical foundations.

6.3 Benchmark

The research on rotation invariance and equivariance is still immature and lacks reliable and comprehensive benchmarks. Except for some well-studied tasks, most applications have yet to be intensively investigated. The evaluation metric (Eq. 3) has yet to be commonly adopted, especially for weakly rotation invariant and equivariant methods. Existing metrics cannot reflect the strength of invariance and equivariance.

7 Conclusion

In this survey, we give a comprehensive overview of rotation invariant and equivariant methods in 3D deep learning. We first discuss the limitation of DNNs trained with canonical poses, which motivates the research of rotation invariant and equivariant methods. Then, we define weak/strong invariance and equivariance and provide a unified theoretical framework for analysis. Totally, all the existing methods are divided into rotation invariant ones and equivariant ones, which continue to be subclassified in terms of their principles. At this level, representative literatures are reviewed and discussed, and both applications and datasets are sorted out. Finally, we pose some open problems and deliver future research directions based on challenges and difficulties in the current research. We hope this survey can serve as an effective tool for future research on rotation invariant and equivariant methods.