A full-set tooth segmentation model based on improved PointNet++

Segmentation of a complete set of teeth from three-dimensional (3D) intra-oral scanner images is a crucial step in tooth identification procedures. In large-scale disasters with many victims, teeth are often the preferred and reliable source for victim identification due to their hard and non-deformable characteristics. In this paper we present a study on the automatic segmentation of a complete set of teeth from intra-oral scanner images. We propose a tooth segmentation method based on an improved PointNet++ architecture. To address the problem of inadequate segmentation capability of the teeth-gingival boundary of PointNet++, we introduce a single-point preliminary feature extraction (SPFE) module to better preserve the subtle details that may be overlooked by the original PointNet++ model. In addition, a weighted-sum local feature aggregation (WSLFA) mechanism is proposed to replace the max pooling in PointNet++ to better perform feature aggregation. The experimental results on 52 testing datasets using the network trained on 160 annotated 3D intra-oral scanner images demonstrate that our improved PointNet++ method achieves a segmentation accuracy of 97.68%, and performs well under different dental conditions.


Introduction
Three-dimensional (3D) intra-oral scanning (IOS) is a small-sized optical scanning technology that allows clinicians to use digital intro-oral scanners to obtain relevant information about teeth, mucosa, and the associated soft and hard tissues, generating a 3D model of the oral cavity.It is commonly used to assist in oral examinations, teeth alignment, restoration, and treatment.Compared to cone beam computerized tomography (CBCT), 3D IOS has many advantages, such as no radiation exposure and easy acquisition.With the development of medical technology and peoples' increasing attention to their oral health, the IOS technology is being widely used by orthodontists for significantly improving treatment efficiency in modern dentistry.
The tooth part segmented from the intra-oral scanner images serves as a personalized structure that can be used for personal identification.Tooth identification is of great significance in the identification of victims of natural disasters or crimes because teeth, as one of the hardest tissues in the human body, are not easily deformed, highly individualized and can be well-preserved after severe disasters or violent crimes.The percentage of identified victims using tooth identification methods in some large-scale disasters ranges from 60.63% to 100% [1].As soft tissues such as gums are prone to deformation and decay, tooth identification requires tooth segmentation technology to accurately segment the entire set of teeth from the intra-oral scanner images.Therefore, we study the accurate segmentation of the entire set of teeth from the intra-oral scanner images for tooth identification purposes.
Automated segmentation of teeth from intra-oral scanner images is a challenging task due to the complex boundary between teeth and gingiva, as well as the significant variations in tooth shapes and appearances among differ-ent subjects, such as missing or misaligned teeth.Early tooth segmentation methods often relied on hand-crafted features, including curvature-based methods [2], skeletonbased methods [3], and harmonic field-based methods [4].However, these methods lack robustness and are difficult to adapt to the diverse tooth arrangements of different individuals, often requiring human interaction to complete the segmentation.With the development of 3D deep learning techniques, many deep learning-based tooth segmentation methods have been proposed.One approach is to transform the unorganized 3D intra-oral scanner images (point cloud or mesh data) into two-dimensional (2D) images [5] or octree grids [6], and then use 2D or 3D convolutional neural network (CNN) for segmentation.However, these methods generate additional computational load and cause some information loss due to the conversion of data.Another approach is to directly apply deep learning networks to point cloud or mesh models for segmentation.PointNet [7] and PointNet++ [8] are representative methods for point cloud segmentation that use multi-layer perceptron (MLP) and max pooling for feature extraction.To extract features at different scales, Point-Net++ also employs a multi-scale local feature extraction strategy.However, PointNet++ uses a strategy where the point cloud model is divided into overlapping local regions, and the most distinctive features within each region are extracted using max pooling.This approach may not accurately capture important features at the gingival boundary of each individual tooth, leading to a coarse segmentation result.Lian et al. [9] designed MeshSegNet for tooth segmentation based on the mesh structure of 3D intra-oral scanner images.This model uses a graph neural network (GNN) to process the mesh structure, which operates on the graph representation of the mesh.However, MeshSegNet does not have the encoder-decoder structure of PointNet++, which means that the resolution of the input mesh model is not compressed throughout the network and leads to a higher number of parameters compared with PointNet++.Simplification of the input mesh model is usually necessary for this network to be used.Some studies attempt to simplify the structure of such networks as CNN and Transformer.Li et al. [10] used dynamic networks to reduce computational redundancy by automatically adjusting their architectures for different inputs, and they made further improvements to dynamic networks by pre-defining dense weight slices of varying importance in a dynamic super-net using nested residual learning.
In this paper, we propose an improved network structure based on PointNet++ for the full-set tooth segmentation of 3D intra-oral scanner images.To address the problem of inadequate segmentation capability of the teeth-gingival boundary of PointNet++, a single-point preliminary feature extraction (SPFE) module is added to better preserve the subtle details that may be overlooked by the original PointNet++ model.In addition, inspired by Li et al. [10] using dynamic weights to adjust network architectures, we use dynamic weights to aggregate features and propose a weighted-sum local feature aggregation (WSLFA) mechanism to replace the max pooling in PointNet++, thus enabling better feature aggregation.The proposed method can achieve an accuracy of 97.68% for tooth segmentation.

Point cloud deep learning
A point cloud is a collection of points in space used to represent a 3D shape.Due to the unordered and nonstructural nature of point clouds, it is difficult to directly apply standard CNNs in the task of tooth segmentation.The PointNet series utilizes symmetry operations to handle the disorder and non-structure of point clouds for classification and segmentation tasks.Specifically, PointNet [7] made groundbreaking work by using MLP to extract features from each point and aggregating features using max pooling.Since MLP and max pooling are both symmetric operations, they help to handle the permutation invariance of point clouds.PointNet++ [8] divides the point cloud into hierarchical groups and uses the same MLP and max pooling as PointNet does to extract features at different levels.Features learned from multiple scales and layers are combined to obtain better robustness.Other methods attempt to apply convolution on point clouds, such as PointCNN [11], which uses MLP to learn a transformation matrix, normalizes the point cloud with this matrix, and then extracts features using CNN.In addition, some graph-based methods treat each point in a point cloud as a vertex in a graph and establish edges between these vertices to create a graph structure.For example, edge conditioned convolution (ECC) [12] performs convolution-like operations on graph-structured data in spatial domains.DGCNN [13] constructs directed graphs in both the original point cloud and feature space, and dynamically updates features after each layer in the network.EdgeConv, which was proposed in DGCNN, captures local geometric structures and is dynamically implemented in each layer of the network.Furthermore, to improve performance and reduce model size, LDGCNN [14] removes the transformation network learned from different layers in DGCNN and links hierarchical features learned from different layers in DGCNN to improve performance and reduce model size.

3D intra-oral scanner images segmentation
The traditional method for segmenting 3D intra-oral scanner images usually involves pre-defining geometric standards to separate teeth from the intra-oral scanner images.For example, Zou et al. [4] used a harmonic field defined on a triangular mesh to iteratively annotate teeth on the tooth surface model.Kumar et al. [2] adopted curvature to segment teeth.Wu et al. [3] defined the morphologic skeleton of the scanned teeth grid and used region growing operations to segment teeth from the intra-oral scanner images iteratively.Although these methods are intuitive, they typically depend on expert prior knowledge and require tedious manual operations, leading to sensitivity to changes in surface appearance.To fully automate tooth segmentation and improve segmentation robustness, an increasing number of deep learning methods are being applied to precise segmentation of teeth from 3D intra-oral scanner images.Xu et al. [5] developed a two-stage tooth segmentation model that includes teeth-gingival segmentation and inter-teeth segmentation.The method first extracts 600dimensional geometric features (coordinates, curvature, principal component analysis (PCA), etc.) for each facet of intra-oral scanner images and packs them into a 20 × 30 image, and then performs segmentation using the twostage CNN network.However, this method ignores the disorder and different packing orders of the hand-designed geometric features, which affects the segmentation results.Tian et al. [6] applied sparse octree methods to voxelize unordered 3D meshes and then used 3D CNN for tooth segmentation, but voxelization can cause loss of model information.Lian et al. [9] designed MeshSegNet, which uses the characteristics of mesh models to combine PointNet with graphs and a multi-scale graph-constrained learning module for simulating CNN multi-scale feature extraction.Li et al. [15] established a multi-scale bilateral enhancement network and adopted a bilateral enhancement module for multi-scale feature extraction.However, these two methods produce a large number of model parameters.As highly accurate intra-oral scanner images may have a large number of mesh grids, simplification of the input mesh model is usually necessary.Other scholars have used instance segmentation methods to segment individual teeth to avoid the problem of uncertain semantic numbers caused by different numbers of teeth.For example, Zanjani et al. [16] presented Mask-MCNet, which for the first time applies instance segmentation to 3D intra-oral scanner images.The network first predicts 3D bounding boxes of teeth, and then performs segmentation of the points that belong to each individual tooth instance.Tian et al. [17] introduced a point cloud-based 3D tooth instance segmentation method and used an instance-aware module based on attention mechanisms to extract local and global features to better distinguish different tooth instances.Cui et al. [18] proposed TSegNet, which represents tooth segmentation as two sub-problems: tooth centroid prediction and individual tooth segmentation in order to segment 3D tooth models quickly and accurately.These works segment single tooth from intra-oral scanner images, and the segmented models are primarily used for orthodontics, dental diagnosis, etc. Multi-modal learning such as utilizing visual content from videos in unsupervised machine translation [19] has also been applied to tooth segmentation.Jang et al. [20] used both 2D and 3D images for tooth segmentation and developed a hierarchical multi-step model that first generates regions of interest from 2D images and then performs segmentation on 3D models.
In order to facilitate the use of teeth for identification in forensic medicine, our paper aims to segment the full set of teeth from the intra-oral scanner images to retain the holistic identification features.Therefore, we do not conduct experiments on the segmentation of a single tooth, but instead conduct experiments on the segmentation of the full set of teeth.

Full-set tooth segmentation model based on improved PointNet++
The network structure of the proposed model is illustrated in Fig. 1.Similar to PointNet++, our network has an encoder-decoder structure.In the encoder part, the input intra-oral scanner images are gradually down-sampled and local features are extracted, with the block responsible for down-sampling and local feature extraction called the set abstraction (SA) layer.In the decoder part, up-sampling is performed to restore the original model resolution, with the block responsible for up-sampling and feature backpropagation called the feature propagation (FP) layer.In this paper, we propose the following two improvements based on PointNet++: 1) A single-point preliminary feature extraction (SPFE) module is added to address the problem of directly extracting local region features and ignoring subtle details in PointNet++, allowing detailed information to be better preserved.
2) A weighted-sum local feature aggregation (WSLFA) mechanism is proposed to better balance the fusion of various useful information in the local region, and to retain important features of teeth-gingival boundaries that are more useful for segmentation, which receive better preservation under the proposed aggregation mechanism.
Let N be the number of points in the input intra-oral scanner images.Before down-sampling, a SPFE module is applied to extract N × 64 dimensional features.The encoder part includes three SA layers.As demonstrated in Fig. 2, the SA layer first constructs local regions, which include down-sampling N i center points on the basis of the previous layer and constructing a spherical neighborhood with a radius of r for each center point.Then, for each point in the local region, the network learns a feature vector and a weight.These feature vectors and weights are then used to obtain a weighted sum of the features for all points in the region, resulting in a global feature that represents the entire spherical region.After one SA layer, the N i feature vectors obtained are sent to the next SA layer for further down-sampling and feature extraction, including three SA layers in total.The decoder part includes three FP layers to gradually reconstruct the original number of points of the input model.As presented in Fig. 2, the FP layer first interpolates to restore the point number of the previous SA layer, and then makes a skip connection with this SA layer.Next, MLP is applied to learn a new feature vector that is sent to the next FP layer.Finally, MLP compresses the feature dimension to two categories (teeth and gingiva), outputs the probability of each category, and predicts the category label for each point after restoring the original number of points.

Network input and single-point preliminary feature extraction
The input of the network is the point cloud data of the intra-oral scanner images, which can be represented as an To visualize the results of SPFE, the 64-dimensional features extracted are reduced to one-dimensional features by PCA. Figure 3(a) and Fig. 3(b) show two feature maps of different intra-oral scanner images after dimension reduction.It is noted that the feature maps indicate that the SPFE module retains more detailed information of the teethgingival boundary, dental groove and dental gap, thus improving the network's ability to extract features from point clouds.Moreover, the SPFE module makes preliminary distinctions between different regions of teeth (Fig. 3, the front teeth are red and the upper jaw is blue), which is beneficial for subsequent segmentation.
Figure 4(a) demonstrates the features extracted by the SA1 layer after using the SPFE module, while Fig. 4(b) shows the features extracted by the SA1 layer directly using the original 9-dimensional vector.Similarly, these 128dimensional features are visualized after PCA dimensional reduction to one dimension.It is observed that the features of teeth and gingiva after using the SPFE module are more distinguishable compared with directly using the original 9-dimensional vector, so the SPFE module can enhance the effectiveness of subsequent SA1 layer feature extraction.

Local region construction
In the SA layer, local region construction is first performed to divide the point cloud into overlapping local regions as presented in Fig. 5, preparing for subsequent feature extraction.The input point cloud coordinates are downsampled to obtain the center point of each local region; then, a sphere with a certain radius is constructed around these points.The number of sampled points N i and the sphere radius r in each SA layer are adjustable parameters.The sampling algorithm used is farthest point sampling (FPS) [6], ensuring that the sampling points are uniformly distributed.The sampled center points and their spherical neighborhoods constitute a local region, and representative features are extracted for each region.

Local feature extraction and weighted-sum local feature aggregation
The local feature extraction and aggregation module is illustrated in Fig. 6.Let N be the number of input points.
where p ij -p i represents the neighborhood point coordinates minus the corresponding center point coordinates, that is, in each spherical neighborhood, the coordinates of the points are standardized relative to the center point.
After extracting features f ij , it is necessary to aggregate these point features into global features representing the local regions.The PointNet series adopts the method of max pooling; however, max pooling can only capture the most distinctive feature in the region and cannot retain more internal details of the region.Therefore, we propose an adaptive method to learn the weight of each feature in a sub-network, and then perform weighted summation, called weighted-sum local feature aggregation named WSLFA, thereby preserving the internal details of the region.Our method can adaptively adjust weights during the training process, effectively weighting the features of each part based on their contribution to segmentation.By comparison, PointNet++ does not distinguish the contribution of each part's features, thus causing the effective features to be ignored and resulting in poor segmentation results.
Our method first uses the coordinates of each point p ij in the local region and its learned new feature f ij to learn a weight vector α ij of the same dimension as f ij : where, p ij -p i represents the coordinate difference between the neighbor points and the corresponding center point.f mean i is the mean of all features f ij in the region, that is: The global feature f i of the region is obtained by weighted summation of the point features f ij and weight vectors α ij , expressed as: where f i is the Hadamard product of the weight vectors and the point features.
The weight vectors learned by the aforementioned process in the SA1 layer is illustrated in Fig. 7.The 128dimensional weight vectors are shown after PCA dimensionality reduction to one dimension, and Fig. 7(a

Feature backpropagation
In the feature backpropagation stage, the locally aggregated features are gradually restored to the original size of the point cloud for segmentation prediction.This includes three steps: interpolation, skip connection, and MLP.The first step is to restore the output point number of the (l -1)th SA layer from the l-th SA layer through interpolation.Let the original point set be P l , and the restored point set be P l-1 .Each point in P l contains a 3D coordinate p l i and a feature vector f l i , and the restored coordinate p l-1 i is the same as the coordinate of the (l -1)-th SA layer.The restored feature f l-1 i can be represented as the weighted average of the features of its three nearest original points: Feature f l-1 i obtained after interpolation is concatenated with the feature obtained from the (l -1)-th SA layer through skip connection, and the concatenated features obtained are then compressed by MLP to reduce the feature dimension.The above mentioned operations are repeated until the original number of points N is restored.Finally, the N × 2 segmentation prediction score matrix is output by MLP, which predicts the probabilities of teeth and gingiva categories, and the maximum predicted probability is selected as the final segmentation category for each point.

Loss function
The loss function used in our model is the negative loglikelihood (NLL) loss function.When the model outputs the probability distribution of two classes (teeth and gingiva) for each point, the probability distribution is used to measure the difference between the predicted results and true labels.Specifically, it measures the error by taking the negative log of the probability of the true class label for each point, and averaging the negative log errors within a batch.NLL loss for one point can be expressed as: where p 1 and p 2 represent the probabilities of the point being teeth or gingiva, respectively.a takes the value of 0 or 1, where a = 1 means the true label of the point is teeth and a = 0 means the true label is gingiva.When the predicted values and the true labels are not consistent, the corresponding probability will be small, resulting in a larger negative log probability for that class, thus increasing the NLL loss value.Therefore, by minimizing the NLL loss value, the model can predict the labels of the input samples more accurately.

Datasets
Our experimental data consist of 212 3D intra-oral scanner images that are manually labeled as either teeth or gingiva.A total of 160 of these examples are used for training, while the remaining 52 are used for testing.Specifically, we select 13 intra-oral scanner images with poor dental conditions (missing teeth, uneven dentition, etc.) as shown in Fig. 8 and use a total of 52 testing datasets to discuss the generalization and robustness of our method.Each model is Figure 8 Intra-oral scanner images with poor dental condition sampled to contain 32,768 points, with each point containing 3D positional information (x, y, z) and a corresponding 3D normal vector.Additionally, the training data are augmented with the following operations: (1) random rotations and (2) random translations in coordinates.Each training example undergoes these two operations before participating in network training.

Implementation details
The network is implemented using PyTorch, with a GPU version of Tesla V100 and an Ubuntu operating system.The Adam optimizer is used during training, with the NLL loss function and an initial learning rate of 0.001.The learning rate is reduced by a factor of 0.7 every 20 epochs, with a minimum learning rate of 0.00001.The batch size is set to 4 during training, and the network is trained for a total of 100 epochs.

Experimental results
This section includes two experiments.Section 4.3.1 tests the effectiveness of our method on datasets with different dental conditions and compares it with other methods.Section 4.3.2tests the effectiveness of our method under different sampling points compared with PointNet++.

Comparison of experimental results with other
methods Table 1 presents the experimental results of our method and other methods (PointNet, PointNet++ and PointCNN) on the whole dataset, while Table 2 shows the experimental results on the dataset with only poor tooth conditions.The training loss curve is demonstrated in Fig. 9.The number of points sampled for each layer in the model is 1024/512/256, and the radius of the spherical neighborhood for each layer is 0.05, 0.1, and 0.2 (normalized), respectively.Compared with PointNet++, our method achieves significantly higher segmentation accuracy and mean intersection over union (mIoU), and the loss curve decreases more rapidly.Our method performs equally well on datasets with different dental conditions, demonstrating its robustness.The visualization of the segmentation results with different dental conditions is presented in Fig. 10, and our method achieves more accurate segmentation of the teeth-gingival boundary with smoother boundary curves regardless of the condition of the teeth.Due to the effects of SPFE and WSLFA, our method significantly eliminates jagged edges and mis-segmentation compared with the original PointNet++.
Although our method has achieved good segmentation results on most of the teeth-gingival boundaries, it struggles to predict the wisdom teeth.This is because the degree of wisdom tooth germination varies among individuals, and the intra-oral scanner images are more blurry and may be incomplete in the area of wisdom teeth and  nearby gingiva.Therefore, our method sometimes mistakenly identifies a portion of the gingiva as wisdom teeth as shown in Fig. 11.To avoid this issue, branch networks can be used to first predict the center points of each tooth, and then perform semantic segmentation, which is the next direction of our work.In addition, the intra-oral scanner images only include the crown portion of the teeth and cannot reveal the root portion beneath the gingiva, and using CBCT and intra-oral scanner images for multimodal learning can compensate for this deficiency.

Segmentation performance comparison under different sampling points
Due to limitations in computational load, some devices find it difficult to use large sampling points for segmentation, which results in a decrease in segmentation accuracy.To test the robustness of our method under different sampling points, we change the points' number and compare our method with PointNet++.The experimental results show that our method is less sensitive to changes in the number of sampled points compared with PointNet++. Figure 12 illustrates the comparison of mIoU for tooth segmentation using PointNet++ and our method under different sampled point numbers.The number of sampled points in the first, second, and third layers are 4096/1024/512, 1024/512/256, 512/256/128, and 256/128/64, respectively.When the number of sampled points decreases, PointNet++ shows a faster decline in mIoU, while our method is less sensitive to this variation, because SPFE and WSLFA can make better use of detailed information.When the number of sampled points decreases from 4096/1024/512 to 256/128/64, the mIoU of our method decreases by less than 2%.Figure 13 demonstrates the segmentation results of an intra-oral scanner image under different sampling points using our method and PointNet++.It can be observed that our method outperforms PointNet++ and has better robustness to the decrease in sampling points, especially on the teeth-gingival boundary.

Ablation study
We conduct ablation experiments on the SPFE and the WSLFA mechanism, as displayed in Table 3.The number of sampling points used by the model is 1024/512/256 for each layer, and the radius of each layer's spherical neighborhood is 0.05, 0.1, and 0.2 (normalized).Models 1, 2, 3, and 4 represent the model with neither SPFE nor WSLFA, the model with only WSLFA, the model with only SPFE, and the model with both SPFE and WSLFA, respectively.Models 1 and 3 use max pooling instead of WSLFA.Comparing models 1 and 2, 3 and 4, it can be found that the addition of WSLFA improved the segmentation effect of Models 1 and 3, with the mIoU increasing by approximately 1%, because WSLFA makes better use of the information inside each local region.Comparing Models 1 and

Conclusion
In this paper, an improved PointNet++ based method is proposed for full-set tooth segmentation of 3D intra-oral scanner images.The method first extracts preliminary features from individual points, retaining detailed features as much as possible.Then, multi-scale local regions are constructed, and a weighted-sum local feature aggregation mechanism is proposed to better integrate various useful information in local regions.These two methods effectively solve the problem of imprecise tooth segmenta- tion, and achieve good segmentation results through clinical data experiments.For future research, adaptive adjustment of feature aggregation radius will be considered to better adapt to the complex teeth-gingival boundaries and further improve the accuracy of the method, and branch networks can be used to improve the accuracy of wisdom tooth segmentation.In addition, post-processing methods such as conditional random fields can be added to refine the boundary curve and improve its smoothness.Based on the above tooth segmentation work, we will conduct research on identity recognition using the segmented tooth parts, with the aim of utilizing the tooth model to recognize identity.

Figure 1 Figure 2
Figure 1 Network structure of the improved PointNet++.SA represents set abstraction, FP represents feature propagation, and MLP denotes multiple layer perceptron

N × 9
matrix.Here, N represents the number of points in the point cloud model, and each point is represented by a 9-dimensional vector, including 3D coordinates, 3D normal vectors, and 3D zero-mean coordinates.In Point-Net++, the input point cloud model directly enters the SA layer for down-sampling and extracting features representing local regions.This approach ignores the detailed information of the point cloud, which reduces the accuracy of PointNet++ in tooth segmentation.Therefore, we add a SPFE module before the SA layer.In this module, the 9-dimensional vector of each point is sent to the MLP for preliminary feature extraction, and a new 64-dimensional feature vector is obtained, which then enters the SA layer for local feature extraction.Due to the extraction of 64 dimensional features from each point, more detailed in-formation can be mined.Every point in our method contributes to the segmentation while only the points that are sampled and in the designated local areas are used in PointNet++; thus our method solves the problem of inaccurate segmentation in detailed area.

Figure 3 Figure 4
Figure 3 Visualization of feature maps of the SPFE module After local region construction, the set of center points P center = {p 1 , p 2 , . . .p N } contains a total of N sampled center point coordinates, where each center point p i has a neighborhood point set P i local = {p i1 , p i2 , . . .p ik } consisting of k neighboring point coordinates.Each point p ij has a feature vector f ij extracted from the previous layer, where the coordinate dimension is d and the feature dimension is C. The input size of this module is N × k × (d + C).Local feature extraction first extracts C -dimensional new

Figure 5 Figure 6
Figure 5 Local region construction.FPS represents farthest point sampling ) and Fig. 7(b) show two different intra-oral scanner images.Weights at the teeth-gingival boundary are larger (green),

Figure 7
Figure 7 Weight vectors learned in the SA1 layer using the weighted-sum local feature aggregation (WSLFA) mechanism

Figure 10 Figure 11 Figure 12
Figure 10 Visualization of the segmentation results

Figure 13
Figure13 Segmentation results of an intra-oral scanner image under different sampling points using our method and PointNet++

Table 1
Experimental results of PointNet, PointCNN and PointNet++ on the whole dataset

Table 2
Experimental results of PointNet, PointCNN and PointNet++ on the dataset with poor dental condition Figure 9 Loss curves of our method and PointNet++

Table 3
Ablation experiments