PCT: Point cloud transformer

The irregular domain and lack of ordering make it challenging to design deep neural networks for point cloud processing. This paper presents a novel framework named Point Cloud Transformer (PCT) for point cloud learning. PCT is based on Transformer, which achieves huge success in natural language processing and displays great potential in image processing. It is inherently permutation invariant for processing a sequence of points, making it well-suited for point cloud learning. To better capture local context within the point cloud, we enhance input embedding with the support of farthest point sampling and nearest neighbor search. Extensive experiments demonstrate that the PCT achieves the state-of-the-art performance on shape classification, part segmentation, semantic segmentation, and normal estimation tasks.


Introduction
Extracting semantics directly from a point cloud is an urgent requirement in some applications such as robotics, autonomous driving, augmented reality, etc.Unlike 2D images, point clouds are disordered and unstructured, making it challenging to design neural networks to process them.Qi et al. [21] pioneered PointNet for feature learning on point clouds by using multi-layer perceptrons (MLPs), maxpooling and rigid transformations to ensure invariance under permutations and rotation.Inspired by strong progress made by convolutional neural networks (CNNs) in the field of image processing, many recent works [24,17,1,31] have considered to define convolution operators that can aggregate local features for point clouds.These methods either reorder the input point sequence or voxelize the point cloud to obtain a canonical domain for convolutions.
Recently, Transformer [26], the dominant framework in natural language processing, has been applied to image vi- sion tasks, giving better performance than popular convolutional neural networks [7,30].Transformer is a decoderencoder structure that contains three main modules for input (word) embedding, positional (order) encoding, and selfattention.The self-attention module is the core component, generating refined attention feature for its input feature based on global context.First, self-attention takes the sum of input embedding and positional encoding as input, and computes three vectors for each word: query, key and value through trained linear layers.Then, the attention weight between any two words can be obtained by matching (dot-producting) their query and key vectors.Finally, the attention feature is defined as the weighted sum of all value vectors with the attention weights.Obviously, the output attention feature of each word is related to all input features, making it capable of learning the global context.All operations of Transformer are parallelizable and order-independent.In theory, it can replace the convolution operation in a convolutional neural network and has better versatility.For more detailed introduction of self-attention, please refer to Section 3.2.
Inspired by the Transformer's success in vision and NLP tasks, we propose a novel framework PCT for point cloud learning based on the principles of traditional Transformer.The key idea of PCT is using the inherent order invariance of Transformer to avoid the need to define the order of point cloud data and conduct feature learning through the attention mechanism.As shown in Figure 1, the distribution of attention weights is highly related to part semantics, and it does not seriously attenuate with spatial distance.
Point clouds and natural language are rather different kinds of data, so our PCT framework must make several adjustments for this.These include: • Coordinate-based input embedding module.In Transformer, a positional encoding module is applied to represent the word order in nature language.This can distinguish the same word in different positions and reflect the positional relationships between words.However, point clouds do not have a fixed order.In our PCT framework, we merge the raw positional encoding and the input embedding into a coordinate-based input embedding module.It can generate distinguishable features, since each point has unique coordinates which represent its spatial position.
• Optimized offset-attention module.The offsetattention module approach we proposed is an effective upgrade over the original self-attention.It works by replacing the attention feature with the offset between the input of self-attention module and attention feature.This has two advantages.Firstly, the absolute coordinates of the same object can be completely different with rigid transformations.Therefore, relative coordinates are generally more robust.Secondly, the Laplacian matrix (the offset between degree matrix and adjacency matrix) has been proven to be very effective in graph convolution learning [3].From this perspective, we regard the point cloud as a graph with the 'float' adjacency matrix as the attention map.Also, the attention map in our work will be scaled with all the sum of each rows to 1.So the degree matrix can be understood as the identity matrix.Therefore, the offset-attention optimization process can be approximately understood as a Laplace process, which will be discuss detailed in Section 3.3.In addition, we have done sufficient comparative experiments, introduced in Section 4, on offset-attention and self-attention to prove its effectiveness.
• Neighbor embedding module.Obviously, every word in a sentence contains basic semantic information.However, the independent input coordinates of the points are only weakly related to the semantic content.Attention mechanism is effective in capturing global features, but it may ignore local geometric information which is also essential for point cloud learning.To address this problem, we use a neighbor embedding strategy to improve upon point embedding.It also assists the attention module by considering attention between local groups of points containing semantic information instead of individual points.
With the above adjustments, the PCT becomes more suitable for point cloud feature learning and achieves the stateof-the-art performance on shape classification, part segmentation and normal estimation tasks.
The main contributions of this paper are summarized as following: 1. We proposed a novel transformer based framework named PCT for point cloud learning, which is exactly suitable for unstructured, disordered point cloud data with irregular domain.
2. We proposed offset-attention with implicit Laplace operator and normalization refinement which is inherently permutation-invariant and more suitable for point cloud learning compare to the original self-attention module in Transformer.

3.
Extensive experiments demonstrate that the PCT with explicit local context enhancement achieves state-ofthe-art performance on shape classification, part segmentation and normal estimation tasks.

Transformer in NLP
Bahdanau et al. [2] proposed a neural machine translation method with an attention mechanism, in which attention weight is computed through the hidden state of an RNN.Self-attention was proposed by Lin et al. [18] to visualize and interpret sentence embeddings.Building on these, Vaswani et al. [26] proposed Transformer for machine translation; it is based solely on self-attention, without any recurrence or convolution operators.Devlin et al. [6] proposed bidirectional transformers (BERT) approach, which is one of the most powerful models in the NLP field.More lately, language learning networks such as XLNet [36], Transformer-XL [5] and BioBERT [15] have further extended the Transformer framework.
However, in natural language processing, the input is in order, and word has basic semantic, whereas point clouds are unordered, and individual points have no semantic meaning in general.

Transformer for vision
Many frameworks have introduced attention into vision tasks.Wang et al. [27] proposed a residual attention ap-proach with stacked attention modules for image classification.Hu et al. [10] presented a novel spatial encoding unit, the SE block, whose idea was derived from the attention mechanism.Zhang el al. [38] designed SAGAN, which uses self-attention for image generation.There has also been an increasing trend to employ Transformer as a module to optimize neural networks.Wu et al. [30] proposed visual transformers that apply Transformer to tokenbased images from feature maps for vision tasks.Recently, Dosovitskiy [7], proposed an image recognition network, ViT, based on patch encoding and Transformer, showing that with sufficient training data, Transformer provides better performance than a traditional convolutional neural network.Carion et al. [4] presented an end-to-end detection transformer that takes CNN features as input and generates bounding boxes with a Transformer encoder-decoder.
Inspired by the local patch structures used in ViT and basic semantic information in language word, we present a neighbor embedding module that aggregates features from a point's local neighborhood, which can capture the local information and obtain semantic information.

Point-based deep learning
PointNet [21] pioneered point cloud learning.Subsequently, Qi et al. proposed PointNet++ [22], which uses query ball grouping and hierarchical PointNet to capture local structures.Several subsequent works considered how to define convolution operations on point clouds.One main approach is to convert a point cloud into a regular voxel array to allow convolution operations.Tchapmi et al. [24] proposed SEGCloud for pointwise segmentation.It maps convolution features of 3D voxels to point clouds using trilinear interpolation and keeps global consistency through fully connected conditional random fields.Atzmon et al [1] present the PCNN framework with extension and restriction operators to map between point-based representation and voxel-based representation.Volumetric convolution is performed on voxels for point feature extraction.MCCNN by Hermosilla et al. [8] allows non-uniformly sampled point clouds; convolution is treated as a Monte Carlo integration problem.Similarly, in PointConv proposed by Wu et al. [31], 3D convolution is performed through Monte Carlo estimation and importance sampling.
A different approach redefines convolution to operation on irregular point cloud data.Li et al. [17] introduce a point cloud convolution network, PointCNN, in which a χtransformation is trained to determine a 1D point order for convolution.Tatarchenko et al. [23] proposed tangent convolution, which can learn surface geometric features from projected virtual tangent images.SPG proposed by Landrieu et al. [13] divides the scanned scene into similar elements, and establishes a superpoint graph structure to learn contextual relationships between object parts.Pan et al. [35] use a parallel framework to extend CNN from the conventional domain to a curved two-dimensional manifold.However, it requires dense 3D gridded data as input so is unsuitable for 3D point clouds.Wang et al. [29] designed an EdgeConv operator for dynamic graphs, allowing point cloud learning by recovering local topology.
Various other methods also employ attention and Transformer.Yan et al. [34] proposed PointASNL to deal with noise in point cloud processing, using a self-attention mechanism to update features for local groups of points.Hertz et al. [9] proposed PointGMM for shape interpolation with both multi-layer perceptron (MLP) splits and attentional splits.
Unlike the above methods, our PCT is based on Transformer rather than using self-attention as an auxiliary module.While a framework by Wang et al. [28] uses Transformer to optimize point cloud registration, our PCT is a more general framework which can be used for various point cloud tasks.

Transformer for Point Cloud Representation
In this section, we first show how the point cloud representation learned by our PCT can be applied to various tasks of point cloud processing, including point cloud classification, part segmentation and normal estimation.Thereafter, we detail the design of PCT.We first introduce a naïve version of PCT by directly applying the original Transformer [26] to point clouds.We then explain full PCT with its special attention mechanism, and neighbor aggregation to provide enhanced local information.

Point Cloud Processing with PCT
Encoder.The overall architecture of PCT is presented in Figure 2. PCT aims to transform (encode) the input points into a new higher dimensional feature space, which can characterize the semantic affinities between points as a basis for various point cloud processing tasks.The encoder of PCT starts by embedding the input coordinates into a new feature space.The embedded features are later fed into 4 stacked attention module to learn a semantically rich and discriminative representation for each point, followed by a linear layer to generate the output feature.Overall, the encoder of PCT shares almost the same philosophy of design as the original Transformer, except that the positional embedding is discarded, since the point's coordinates already contains this information.We refer the reader to [26]   by PCT is then formed by concatenating the attention output of each attention layer through the feature dimension, followed by a linear transformation: where AT i represents the i-th attention layer, each having the same output dimension as its input, and W o is the weights of the linear layer.Various implementations of input embedding and attention will be explained later.
To extract an effective global feature vector F g representing the point cloud, we choose to concatenate the outputs from two pooling operators: a max-pooling (MP) and an average-pooling (AP) on the learned point-wise feature representation [29].
Classification.The details of classification network using PCT is shown in Figure 2. To classify a point cloud P into N c object categories (e.g.desk, table, chair), we feed the global feature F g to the classification decoder, which comprises two cascaded feed-forward neural networks LBRs (combining Linear, BatchNorm (BN) and ReLU layers) each with a dropout probability of 0.5, finalized by a Linear layer to predict the final classification scores C ∈ R Nc .The class label of the point cloud is determined as the class with maximal score.
Segmentation.For the task of segmenting the point cloud into N s parts (e.g.table top, table legs; a part need not be contiguous), we must predict a part label for each point, we first concatenate the global feature F g with the point-wise features in F o .To learn a common model for various kinds of objects, we also encode the one-hot object category vector as a 64-dimensional feature and concatenate it with the global feature, following most other point cloud segmentation networks [22].As shown in Figure 2, the architecture of the segmentation network decoder is almost the same as that for the classification network, except that dropout is only performed on the first LBR.We then predict the final point-wise segmentation scores S ∈ R N ×Ns for the input point cloud: Finally, the part label of a point is also determined as the one with maximal score.
Normal estimation.For the task of normal estimation, we use the same architecture as in segmentation by setting N s = 3, without the object category encoding, and regard the output point-wise score as the predict normal.

Naïve PCT
The simplest way to modify Transformer [26] for point cloud use is to treat the entire point cloud as a sentence and each point as a word, an approach we now explain.This naïve PCT is achieved by implementing a coordinate-based point embedding and instantiating the attention layer with the self-attention introduced in [26].
First, we consider a naïve point embedding, which ignores interactions between points.Like word embedding in NLP, point embedding aims to place points closer in the embedding space if they are more semantically similar.Specifically, we embed a point cloud P into a d e -dimensional space F e ∈ R N ×de , using a shared neural network comprising two cascaded LBRs, each with a d e -dimensional output.We empirically set d e = 128, a relatively small value, for computational efficiency.We simply use the point's 3D coordinates as its input feature description (i.e.d p = 3) (as doing so still outperforms other methods) but additional pointwise input information, such as point normals, could also be used.
For the naïve implementation of PCT, we adopt self-attention (SA) as introduced in the original Transformer [26].Self-attention, also called intra-attention, is a mechanism that calculates semantic affinities between different items within a sequence of data.The architecture of the SA layer is depicted in Figure 3 by switching to the dotted data flows.Following the terminology in [26], let Q, K, V be the query, key and value matrices, respectively, generated by linear transformations of the input features F in ∈ R N ×de as follows: where W q , W k and W v are the shared learnable linear transformation, and d a is the dimension of the query and key vectors.Note that d a may not be equal to d e .In this work, we set d a to be d e /4 for computational efficiency.First, we can use the query and key matrices to calculate the attention weights via the matrix dot-product: These weights are then normalized (denoted SS in Figure 3) to give A = (α) i,j : The self-attention output features F sa are the weighted sums of the value vector using the corresponding attention weights: As the query, key and value matrices are determined by the shared corresponding linear transformation matrices and the input feature F in , they are all order independent.Moreover, softmax and weighted sum are both permutationindependent operators.Therefore, the whole self-attention process is permutation-invariant, making it well-suited to the disordered, irregular domain presented by point clouds.
Finally, the self-attention feature F sa and the input feature F in , are further used to provide the output feature F out for the whole SA layer through an LBR network:

Offset-Attention
Graph convolution networks [3] show the benefits of using a Laplacian matrix L = D − E to replace the adjacency matrix E, where D is the diagonal degree matrix.Similarly, we find that we can obtain better network performance if, when applying Transformer to point clouds, we replace the original self-attention (SA) module with an offset-attention (OA) module to enhance our PCT.As shown in Figure 3, the offset-attention layer calculates the offset (difference) between the self-attention (SA) features and the input features by element-wise subtraction.This offset feeds the LBR network in place of the SA feature used in the naïve version.Specifically, Equation 5 is modified to: F in − F sa is analogous to a discrete Laplacian operator, as we now show.First, from Equations 2 and 5, the following holds: Here, W v is ignored since it is a weight matrix of the Linear layer.I is an identity matrix comparable to the diagonal degree matrix D of the Laplacian matrix and A is the attention matrix comparable to the adjacency matrix E.
In our enhanced version of PCT, we also refine the normalization by modifying Equation 4 as follows: Here, we use the softmax operator on the first dimension and an l 1 -norm for the second dimension to normalize the √ d a and uses softmax to normalize the second dimension.However, our offset-attention sharpens the attention weights and reduce the influence of noise, which is beneficial for downstream tasks.Figure 1 shows example offset attention maps.It can be seen that the attention maps for different query points vary considerably, but are generally semantically meaningful.We refer to this refined PCT, i.e. with point embedding and OA layer, as simple PCT (SPCT) in the experiments.

Neighbor Embedding for Augmented Local Feature Representation
PCT with point embedding is an effective network for extracting global features.However, it ignore the local neighborhood information which is also essential in point cloud learning.We draw upon the ideas of PointNet++ [22] and DGCNN [29] to design a local neighbor aggregation strategy, neighbor embedding, to optimize the point embedding to augment PCT's ability of local feature extraction.As shown in Figure 4, neighbor embedding module comprises two LBR layers and two SG (sampling and grouping) layers.The LBR layers act as the basis point embedding in Section 3.2.We use two cascaded SG layers to gradually enlarge the receptive field during feature aggregation, as is done in CNNs.The SG layer aggregates features from the local neighbors for each point grouped by k-NN search using Euclidean distance during point cloud sampling.
More specifically, assume that SG layer takes a point cloud P with N points and corresponding features F as input and outputs a sampled point cloud P s with N s points and its corresponding aggregated features F s .First, We adopt the farthest point sampling (FPS) algorithm [22] to downsample P to P s .Then, for each sampled point p ∈ P s , let knn(p, P) be its k-nearest neighbors in P. We then com-pute the output feature F s as follows: where F(p) is the input feature of point p, F s (p) is the output feature of sampled point p, MP is the max-pooling operator, and RP(x, k) is the operator for repeating a vector x k times to form a matrix.The idea of concatenating the feature among sampled point and its neighbors is drawn from EdgeConv [29].
We use different architectures for the tasks of point cloud classification, segmentation and normal estimation.For the point cloud classification, we only need to predict a global class for all points, so the sizes of the point cloud are decreased to 512 and 256 points within the two SG layer.
For point cloud segmentation or normal estimation, we need to determine point-wise part labels or normal, so the process above is only used for local feature extraction without reducing the point cloud size, which can be achieved by setting the output at each stage to still be of size N .

Experiments
We now evaluate the performance of naïve PCT (NPCT, with point embedding and self-attention), simple PCT (SPCT, with point embedding and offset-attention) and full PCT (with neighbor embedding and offset-attention) on two public datasets, ModelNet40 [32] and ShapeNet [37], giving a comprehensive comparison with other methods.The same soft cross-entropy loss function as [29] and the stochastic gradient descent (SGD) optimizer with momentum 0.9 were adopted for training in each case.Other training parameters, including the learning rate, batch size and input format, were particular to each specific dataset and are given later.

Classification on ModelNet40 dataset
ModelNet40 [32] contains 12,311 CAD models in 40 object categories; it is widely used in point cloud shape classification and surface normal estimation benchmarking.For a fair comparison, we used the official split with 9,843 objects for training and 2,468 for evaluation.The same sampling strategy as used in PointNet [21] was adopted to uniformly sample each object to 1,024 points.During training, a random translation in [−0.2, 0.2], a random anisotropic scaling in [0.67, 1.5] and a random input dropout were applied to augment the input data.During testing, no data augmentation or voting methods were used.For all the three models, the mini-batch sizes were 32, 250 training epochs were used and the initial learning rates were 0.01, with a cosine annealing schedule to adjust the learning rate at every epoch.
Experimental results are shown in Table 1.Compared to PointNet and NPCT, SPCT makes a 2.8% and 1.0% improvement respectively.PCT achieves the best result of 93.2% overall accuracy.Note that our network currently does not consider normals as inputs which could in principle further improve network performance.

Normal estimation on ModelNet40 dataset
The surface normal estimation is to determine the normal direction at each point.Estimating surface normal has wide applications in e.g.rendering.The task is challenging because it requires the approach to understand the shapes

Segmentation task on ShapeNet dataset
Point cloud segmentation is a challenging task which aims to divide a 3D model into multiple meaningful parts.We performed an experimental evaluation on the ShapeNet Parts dataset [37], which contains 16,880 3D models with a training to testing split of 14,006 to 2,874.It has 16 object categories and 50 part labels; each instance contains no fewer than two parts.Following PointNet [21], all models were downsampled to 2,048 points, retaining pointwise part annotation.During training, random translation in [−0.2, 0.2], and random anisotropic scaling in [0.67, 1.5] were applied to augment the input data.During testing, we used a multi-scale testing strategy, where the scales are set in [0.7, 1.4] with a step of 0.1.For all the three models, the batch size, training epochs and the learning rates were set the same as the training of normal estimation task.
Table 3 shows the class-wise segmentation results.The evaluation metric used is part-average Intersection-over-Union, and is given both overall and for each object category.The results show that our SPCT makes an improvement of 2.1% and 0.6% over PointNet and NPCT respectively.PCT achieves the best results with 86.4% partaverage Intersection-over-Union. Figure 5 shows further segmentation examples provided by PointNet, NPCT, SPCT and PCT.

Semantic segmentation task on S3DIS dataset
The S3DIS is a indoor scene dataset for point cloud semantic segmentation.It contains 6 areas and 271 rooms.Each point in the dataset is divided into 13 categories.For fair comparison, we use the same data processing method as [21].Table 4 shows that our PCT achieves superior performance compared to the previous methods.

Computational requirements analysis
We now consider the computational requirements of NPCT, SPCT, PCT and several other methods by comparing the floating point operations required (FLOPs) and number of parameters (Params) in Table 5. SPCT has the lowest memory requirements with only 1.36M parameters and also puts a low load on the processor of only 1.82 GFLOPs, yet delivers highly accurate results.These characteristics make it suitable for deployment on a mobile device.PCT has best performance, yet modest computational and memory requirements.If we pursue higher performance and ignore the amount of calculation and parameters, we can add a neighbor embedding layer in the input embedding module.The results of 3-Layer embedding PCT are shown in Table 6 and 7.

Conclusion
In this paper, we propose a permutation-invariant point cloud transformer, which is suitable for learning on unstructured point clouds with irregular domain.The proposed offset-attention and normalization mechanisms help to make our PCT effective.Experiments show that PCT has good semantic feature learning capability, and achieves state-of-the-art performance on several tasks, particularly shape classification, part segmentation and normal estimation.
Transformer has already revealed powerful capabilities given large amounts of training data.At present, the available point cloud datasets are very limited compared to image.In future, we will train it on larger datasets and study its advantages and disadvantages with respect to other popular frameworks.Besides, the encoder-decoder structure of Transformer support more complex tasks, such as point cloud generation and completion.We will extend the PCT to further applications.

Figure 1 .
Figure 1.Attention map and part segmentation generated by PCT.First three columns: point-wise attention map for different query points (indicated by $), yellow to blue indicating increasing attention weight.Last column: part segmentation results.

Figure 2 .
Figure 2. PCT architecture.The encoder mainly comprises an Input Embedding module and four stacked Attention module.The decoder mainly comprises multiple Linear layers.Numbers above each module indicate its output channels.MA-Pool concatenates Max-Pool and Average-Pool.LBR combines Linear, BatchNorm and ReLU layers.LBRD means LBR followed by a Dropout layer.

Figure 3 .
Figure 3. Architecture of Offset-Attention.Numbers above tensors are numbers of dimensions N and feature channels D/Da, with switches showing alternatives of Self-Attention or Offset-Attention: dotted lines indicate Self-Attention branches.

Figure 4 .
Figure 4. Left: Neighbor Embedding architecture; Middle: SG Module with Nin input points, Din input channels, k neighbors, Nout output sampled points and Dout output channels; Top-right: example of sampling (colored balls represent sampled points); Bottom-right: example of grouping with k-NN neighbors; Number above LBR: number of output channels.Number above SG: number of sampled points and its output channels.attention map.The traditional Transformer scales the first dimension by 1/√ d a and uses softmax to normalize the second dimension.However, our offset-attention sharpens the attention weights and reduce the influence of noise, which is beneficial for downstream tasks.Figure1shows example offset attention maps.It can be seen that the attention maps for different query points vary considerably, but are generally semantically meaningful.We refer to this refined PCT, i.e. with point embedding and OA layer, as simple PCT (SPCT) in the experiments.
for details of the original NLP Transformer.Formally, given an input point cloud P ∈ R N ×d with N points each having d-dimensional feature description, a d e -dimensional embedded feature F e ∈ R N ×de is first learned via the Input Embedding module.The point-wise d o -dimensional feature representation F o ∈ R N ×do output

Table 1 .
Comparison with state-of-the-art methods on the Mod-elNet40 classification dataset.Accuracy means overall accuracy.All results quoted are taken from the cited papers.P = points, N = normals.

Table 3 .
Comparison on the ShaperNet part segmentation dataset.pIoU means part-average Intersection-over-Union.All results quoted are taken from the cited papers.

Table 6 .
Comparison on the ModelNet40 classification dataset.PCT-2L means PCT with 2 layer neighbor embedding and PCT-3L means PCT with 3 layer neighbor embedding.Accuracy means overall accuracy.P = points.

Table 7 .
Comparison on the ShaperNet part segmentation dataset.pIoU means part-average Intersection-over-Union.PCT-2L means PCT with 2 layer neighbor embedding and PCT-3L means PCT with 3 layer neighbor embedding.