1 Introduction

Understanding indoor scenes is one of the essential tasks for computer vision and intelligence. The rapid development of depth sensors and three-dimensional (3D) scanners, such as RGB-D cameras and LiDAR, has increased interest in 3D indoor scene comprehension for a variety of applications, such as robotics [1], navigation [2] and augmented/virtual reality [3, 4]. The objective of 3D indoor scene understanding is to discern the geometric and semantic context of each interior scene component. There are a variety of 3D data formats, including depth images, meshes, voxels, and point clouds. Among them, point clouds are the most common non-discretized data representation in 3D applications and can be acquired directly by 3D scanners or reconstructed from stereo or multi-view images.

Point cloud segmentation, which attempts to decompose indoor scenes into meaningful parts and label each point, is a fundamental and indispensable step in understanding 3D indoor scenes. Point clouds provide the original spatial information, making them the preferred data format for segmenting indoor scenes. Segmentation of indoor scene point clouds can be divided into semantic segmentation and instance segmentation. Semantic segmentation assigns each point with a scene-level object category label. Instance segmentation is more difficult and requires individual object identification and localization. Unlike outdoor point cloud segmentation, which addresses dynamic objects, indoor point cloud segmentation commonly handles cluttered man-made objects with regularly designed shapes. Indoor point cloud data are usually captured by consumer-level sensors with short ranges, while outdoor point clouds are commonly collected by LiDAR. Indoor point cloud segmentation faces several challenges. First, point cloud data are typically large and voluminous, with varying qualities from different sensors. This makes it difficult to efficiently process and accurately annotate point cloud data. Second, indoor scenes are typically cluttered with severe occlusions. It is challenging to accurately segment objects when they are hidden or close together. Third, unlike regular data structures in two-dimensional (2D) images, point cloud data are sparse and unorganized, making it difficult to apply sophisticated 2D segmentation methods directly to 3D point clouds. Moreover, annotating 3D data is time-consuming and labor-intensive, limiting the ability of fully supervised learning. Existing indoor point cloud datasets are limited and suffer from long-tailed distributions.

Much effort has been devoted to the task of point cloud segmentation. Traditional geometry-based solutions for point cloud segmentation mainly include clustering-based, model-based, and graph-based methods [5]. The majority of these methods rely on hand-crafted features with heuristic geometric constraints. Deep learning has made significant progress in 2D vision [68], leading to advances in point cloud segmentation. In recent years, point cloud based deep neural networks [9] have demonstrated the ability to extract more powerful features and provide more reliable geometric cues for better understanding 3D scenes. Learning from 3D data has become a reality with the availability of public datasets such as ShapeNet, ModelNet, PartNet, ScanNet, Semantic3D, and KITTI. Recently, weakly supervised learning for point cloud segmentation has become a popular research topic, because it attempts to learn features from limited annotated data.

This paper provides a comprehensive review of point cloud segmentation for indoor 3D scene understanding, especially methods based on deep learning. We will introduce the primary datasets and methods used for indoor scene point cloud segmentation, analyze the current research trends in this area, and discuss future directions for development. The structure of this paper is as follows. Section 2 begins by introducing 3D indoor datasets that are used for understanding 3D scenes. Section 3 presents a brief review of geometry-based point cloud segmentation methods. Section 4 reviews indirect learning approaches with structured data. Section 5 provides a comprehensive survey of existing point cloud based learning frameworks employed for 3D scene segmentation. Section 6 introduces recent learning-based segmentation methods with multimodal data. Section 7 summarizes the performance of indoor point cloud segmentation using different methods. Section 8 discusses open questions and future research directions. Section 9 concludes the paper.

2 3D indoor scene point cloud datasets

The emergence of 3D datasets has led to the development of deep learning-based segmentation methods, which play a crucial role in advancing the field and promoting progress in research and applications. Public benchmarks have proven to be highly effective in facilitating framework evaluation and comparison. By providing real-world data with ground truth annotations, these benchmarks offer a foundation for researchers to test their algorithms and enable fair comparisons between different approaches. The two most commonly used 3D indoor scene point cloud datasets are ScanNet [10] and S3DIS [11].

ScanNet. ScanNet [10] is an RGB-D video dataset encompassing more than 2.5 million views across more than 1500 scans. This dataset is captured by RGB-D cameras and extensively annotated with essential information such as 3D camera poses, surface reconstructions, and instance-level semantic segmentations. This dataset has led to remarkable advancements in state-of-the-art performance across various 3D scene understanding tasks, including object detection, semantic segmentation, instance segmentation, and computer-aided design (CAD) model retrieval. ScanNet v2, the modified released version, has meticulously gathered 1,513 scans that have been annotated with impressive surface coverage. In the semantic segmentation task, the V2 version is labeled with annotations for 20 classes of 3D voxelized objects. Each of these classes corresponds to a specific furniture category or room layout, allowing for a more granular understanding and analysis of the captured indoor scenes. This makes ScanNetV2 one of the most active online evaluation datasets tailored for indoor scene semantic segmentation. Apart from semantic segmentation benchmarks, ScanNetV2 also provides benchmarks for instance segmentation and scene type classification.

ScanNet200. ScanNet200 [12] was developed on the basis of ScanNetV2 to overcome the limited set of 20 class labels. It significantly expands the number of classes to 200, representing an order of magnitude increase compared to the previous version. This annotation enables a better capture and understanding of real-world indoor scenes with a more diverse range of objects. This new benchmark allows for a more comprehensive analysis of performance across different object class distributions by splitting the 200 classes into three sets. Specifically, the “head” set comprises 66 categories with the highest frequency, the “common” set consists of 68 categories with less frequency, and the “tail” set contains the remaining categories.

S3DIS. The Stanford large-scale 3D indoor spaces dataset [11], known as S3DIS, acquired through the Matterport scanner, is another highly popular dataset that has been extensively employed in point cloud segmentation. This dataset comprises 272 room scenes divided into 6 distinct areas. Each point within the scene is assigned a semantic label corresponding to one of the 13 pre-defined categories, such as walls, tables, chairs, cabinets, and others. This dataset is specifically curated for large-scale indoor semantic segmentation.

Cornell RGBD. This dataset [13] provides 52 labeled indoor scenes comprising point clouds with RGB values. It consists of 24 labeled office scenes and 28 labeled home scenes. The point cloud data are generated from the original RGB-D images via the RGBD-SLAM method. This dataset contains approximately 550 views with 2495 labeled segments across 27 object classes, providing valuable resources for previous research and development in indoor scene understanding.

Washington RGB-D dataset. This dataset [14] includes 14 indoor scene point clouds, which are obtained via RGB-D image registration and stitching. It provides annotations of 9 semantic category labels, such as sofas, teacups, and hats.

3 Geometry-based segmentation

Geometry-based solutions for understanding indoor scenes can be classified as clustering or region growing, or model fitting based methods. By incorporating heuristic geometric constraints, most of these methods use handcrafted features. The intuition behind geometry-based methods is that man-made environments normally consist of many geometric structures.

Clustering or region growing. These approaches assume that points in close proximity to each other are more likely to belong to the same object or surface. By considering the geometric properties of these neighboring points, such as spatial coordinates and surface normals, these methods can identify regions that share similarities in these properties. Mattausch et al. [15] proposed a method for segmenting indoor scenes by identifying repeated objects from multi-room indoor scanning data. To represent the indoor scenes, they employed a collection of nearly planar patches. These patches were clustered based on a patch similarity matrix, which was constructed using shape geometrical descriptors. Using this approach, the researchers aimed to effectively segment indoor scenes by exploiting the inherent repeated object structures. Hu et al. [16] partitioned point clouds into surface patches using the dynamic region growing method to generate initial segmentation. By leveraging this intermediate data representation, the model can better account for shape variations and enhance its ability to classify objects.

Model fitting. Model fitting is proposed as a more efficient and robust strategy, particularly in scenarios where noise and outliers are present. Nan et al. [17] introduced a search-classify pipeline for scene modeling, utilizing pre-trained object categories to aid in the process. Similarly, Li et al. [18] proposed an object-retrieved approach that replaces scanned data with objects sourced from 3D shape databases. In another approach, Shi et al. [19] trained classifiers for both objects and object groups, which allows the decomposition of indoor sub-scenes. These methods rely primarily on the availability and diversity of current CAD datasets, which limits their effectiveness. Alternatively, another strategy involves employing primitive-based approaches, where indoor scenes are decomposed into a collection of geometric primitives. By utilizing this strategy, researchers aim to capture the essential geometric information of a scene without being overly reliant on extensive CAD datasets. These primitive-based approaches offer an alternative means of scene decomposition. The most widely used primitive fitting method is random sample consensus (RANSAC) [20]. Monszpart et al. [21] processed large-scale indoor point clouds via RANSAC-based plane fitting. Sun et al. [22] developed a graph-cut segmentation method to group primitives found in indoor sub-scenes. The primitives, such as plane and cylinder, were extracted by RANSAC and oriented by PCA. Yu et al. [23] further used a patch relation classier to group planar patches and achieve instance segmentation.

These geometry-based methods cannot directly assign object classes or instance labels to each point. Postprocessing is normally required to produce the final segmentation. Recently, several approaches have used geometry-based segmentation as the pre-processing step and generated mid-level scene representation as the inputs of deep learning frameworks. For instance, Huang et al. [24] clustered on-surface voxels to provide a compact representation of 3D scenes. Landrieu and Simonovsky [25] partitioned the scan data into superpoints, which are geometrically homogeneous elements. Cheng et al. [26] encoded a superpoint-level representation with non-local operation at the neighborhood level. Deng et al. [27] proposed an iterative algorithm to generate superpoints by combining geometry-based and color-based region growing methods. Similarly, geometry-based superpoints have been proven to leverage large-scale point clouds and act as priors in weakly supervised learning [2831]. We will review these deep learning networks in detail in the following sections.

4 Learning-based segmentation with structured data

Unlike geometry-based methods that use hand-crafted features, learning-based approaches automatically extract latent features. However, point cloud feature learning remains difficult due to its unstructured and unordered data format. Since 3D scenes can be represented in several forms, such as multi-view images and voxels, it is natural to transform point clouds into structured data formats.

4.1 Image-based methods

The success and evolution of convolutional neural networks (CNNs) in the image domain has influenced the development of point cloud feature learning. It is common to transform 3D data into a structured multi-view representation [32, 33]. The idea is to use virtual cameras to capture scene point clouds from different angles, resulting in multi-view RGB images along with corresponding depth images. CNN is then applied to perform feature extraction on the RGB images, and the results of downstream tasks are projected back into the 3D space. For instance, the MVCNN [32] virtually scans a 3D object from 12 angles to obtain rendered images, and features are extracted from each image via the CNNs. These features are fused into a descriptor, which is used for object classification. SnapNet [33] also applies CNNs to multi-view images from the point clouds. Unlike MVCNN, SnapNet selects a set of appropriate snapshots of the point cloud to generate an RGB image and corresponding depth map. CNN is employed to optimize the RGB-D inputs and produce pixel labels that are back-projected to corresponding 3D points. However, these networks have some limitations for segmenting indoor scenes. Semantic segmentation based on multiple views has drawbacks, such as the need to choose viewpoints and the number of scans. Moreover, the process of projection and back-projection inevitably leads to some loss of information, especially structural features.

4.2 Voxel-based methods

Converting point clouds into voxel-based representations is another approach for addressing the challenges of regularizing 3D data[3437]. VoxNet [34] generates 3D bounding boxes on point cloud segments and transforms them into volumetric grids to represent spatial occupancy. A 3D CNN is used to predict labels directly from the occupancy grid. SEGCloud [36] utilizes a voxel-based CNN to segment indoor scenes. The idea is to pre-process the input point cloud by dividing the indoor scene into voxels. A 3D fully convolutional network is then employed to generate voxel labels. These voxel labels are then interpolated using trilinear interpolation to assign voxel labels to the corresponding points. VV-Net [35] utilizes a radial basis function-based variational autoencoder network for point cloud processing. Unlike binary voxel-based representations, VV-Net provides richer representations of point clouds. The voxelization of scanned point clouds faces certain challenges. There is a risk of losing fidelity and fine-grained details at low resolutions, while the computational and memory requirements become excessively demanding at high resolutions. Although there have been efforts to mitigate these problems, such as reducing memory consumption and computation [38, 39], voxel-based representations still struggle to handle large-scale scene segmentation in general. Recently, OctFormer [40] attempted to partition input point clouds into local windows using octrees, and used sparse octree attention to enhance segmentation performance.

5 Learning-based segmentation with point clouds

Recent studies have explored the direct application of deep learning techniques, as an alternative to multi-view and voxel-based approaches, to raw scanned point clouds. While point cloud data can be obtained directly from scanning devices, their irregular data format presents challenges to traditional CNNs. To address this issue, PointNet [41] has emerged as a pioneering approach in point-based learning. One limitation of PointNet as a benchmark for point cloud learning is that it does not exploit the structural information within local neighborhoods of points. In response, subsequent studies have made advances by enhancing the use of sampling methods and feature extraction techniques. For instance, some works have improved the sampling method by incorporating farthest point sampling [42] or random sampling [43]. These modifications aim to enhance both the feature extraction capabilities and computational efficiency of the network. These advances built upon the foundation of PointNet address its limitations by incorporating structural information from local neighboring points and refining the feature extraction process. Some researchers choose an alternative approach to PointNet and design particular convolution operations on point clouds [4449]. As a result, these architectures offer improved performance and capabilities for downstream 3D scene understanding tasks such as point cloud semantic segmentation and point cloud instance segmentation. A brief timeline of learning-based point cloud segmentation methods is shown in Fig. 1. These methods include SGPN [50], PAT [51], PointWeb [52], GSPN [53], ASIS [54], KPCov [55], 3D-BoNet [56], 3D-MPA [57], PointGroup [58], MPRM [59], PGCNet [60], WSSPK [61], DyCo3D [62], PSD [63], SSTNet [64], Stratified Transformer [65], HybridCR [66], SoftGroup [67], SegGroup [68], SQN [69], 3D-WSIS [70], and Mask3D [71].

Figure 1
figure 1

A brief timeline of learning-based point cloud segmentation

5.1 Point cloud semantic segmentation

Point cloud semantic segmentation, which is a fundamental task in 3D indoor scene understanding, aims to partition a scene into multiple subsets. Based on the semantic meanings of the individual points, our objective is to assign each point in the scene to a specific category label. Semantic segmentation methods can be categorized according to the amount of monitoring information they rely on. Depending on the availability of annotated data, these methods can be categorized into fully supervised methods and weakly supervised methods.

5.1.1 Fully supervised semantic segmentation

Deep learning-based point cloud semantic segmentation requires large-scale data for training and typically relies on dense annotations. Current fully annotated public datasets make fully supervised point cloud learning possible.

PointNet. Qi et al. [41] introduced the PointNet network architecture. This network comprises three key components: the multi-layer perceptron (MLP) module, the max pooling structure, and the feature fusion structure. The MLP module enables the extraction of point cloud features through weight sharing. The max pooling structure, which employs a symmetric function, selects the maximum feature value within a group of points and serves as the global feature representation. This design addresses the problem of irregularity in the data. The feature fusion structure combines the local features and the global features obtained from the maximum pooling operation. These merged features are utilized as input, and the MLP predicts labels for each point. Moreover, PointNet incorporates the T-Net structure, which facilitates the learning of an efficient rotation matrix. PointNet has demonstrated effectiveness in tasks such as semantic segmentation and object classification, making it a fundamental network architecture in this area.

PointNet++. PointNet++ [42] introduces a set of abstraction structures consisting of sampling layers, grouping layers, and PointNet layers. This hierarchical design enables the extraction of multi-scale features from point clouds. By stacking multiple layers of this feature extraction structure, PointNet++ can be applied for tasks such as point cloud classification and segmentation.

PointCNN. PointCNN [44] transforms the input points into a latent representation. This transformation, known as x-conv, is implemented using MLPs. This transformation allows for the application of traditional convolution, which is effective in capturing local and global patterns in regular data domains.

GCN-based methods. Recent studies have explored the application of a graph convolutional network (GCN) to point clouds, recognizing that the points and their neighboring points can form a graph structure [25, 46, 72]. The objective is to extract local geometric structure information while preserving permutation invariance. This is achieved by constructing a spatial or spectral adjacency graph using the features of vertices and their neighbors. DGCNN [46] employs MLPs to aggregate edge features, which consist of nodes and their spatial neighbors. The features of the nodes are then updated based on the edge features. RGCNN [45] considers the features of data points in a point cloud as graph signals and uses spectral-based graph convolution for point cloud classification and segmentation. The spectral-based graph convolution operation is defined using the approximation of the Chebyshev polynomial. Furthermore, the Laplacian matrix of the graph is updated at each layer of the network based on the learned depth features. This allows for the extraction of local structural information while accounting for the unordered nature of the data. DGCNN and RGCNN demonstrate different approaches to the use of GCN. DGCNN focuses on edge feature aggregation and node feature updates, while RGCNN uses spectral-based graph convolution and updates the Laplacian matrix based on learned depth features. SPG [25] is a deep learning framework specifically designed for the task of semantic segmentation in large-scale point clouds with millions of points. The framework introduces the concept of a superpoint graph (SPG), which effectively captures the inherent organization of 3D point clouds. By dividing the scanned scene into geometrically homogeneous elements, SPG provides a compact representation that captures the contextual relationships between different object parts within the point cloud. Leveraging this rich representation, GCN is employed to learn and infer semantic segmentation labels. The combination of the SPG structure and GCN enables the capture of contextual relationships, resulting in accurate semantic segmentation of complex and voluminous point cloud data. PointWeb [52] designs an adaptive feature extraction module to find the interaction between densely connected neighbors. Unlike most point-based deep learning methods, PGCNet [60] incorporates geometric information as a prior and uses surface patches for data representation. The idea behind this method is that man-made objects can be decomposed into a set of geometric primitives. The PGCNet framework first extracts surface patches from indoor scene point clouds using the region growing method. With surface patches and their geometric features as input, a GCN-based network is designed to explore patch features and contextual information. Specifically, a dynamic graph U-Net module, which employs dynamic edge convolution, aggregates hierarchical feature embeddings. Taking advantage of the surface patch representation, PGCNet can achieve competitive semantic segmentation performance with much less training.

Transformer-based methods. The Transformer technique has revolutionized natural language processing (NLP) and 2D vision [73, 74], inspiring the application of attention-based networks in 3D space. PCT [47] extracts point cloud features through the attention mechanism. It solves the unordered problem by merging the spatial position encoding and the input embedding to represent each point. PAT [51] uses a parameter-efficient group self-attention operation and Gumber–Softmax-based sampling to replace multi-head self-attention and furthest point sampling. Point Transformer [48] directly incorporates local attention between each point and its neighboring points, effectively addressing the memory cost problem. Point Transformer V2 [49] further improves the previous version by replacing the original attention with group vector attention with grouped weight encoding. Stratified Transformer [65] improves long-range context capture by stratified sampling. Refer to Ref. [48] for an illustration of the Transformer-based structure. All these Transformer-based networks can serve as the powerful backbone for various point cloud understanding tasks.

In recent years, Transformer-based backbones have proven to be more effective in exploiting features than other structures, while cost of computing has increased. The identification of efficient and powerful learning networks for point cloud semantic segmentation is worthy of further exploration.

5.1.2 Weakly supervised semantic segmentation

Despite tremendous progress, there are still limitations to the widespread adoption of fully supervised semantic segmentation methods. Extensive and precise annotations are required for fully supervised training, while current point cloud data are still scarce and difficult to annotate. To cope with limited annotated data, researchers have explored weakly supervised learning for semantic segmentation.

One strategy is to utilize only a small fraction of labeled points. Xu and Lee [75] proposed a weakly supervised network for semantic point cloud segmentation. This is achieved by three carefully-designed branches. The Siamese branch enhances the training samples by encouraging consistency between the original predictions and corresponding augmented predictions. The inexact branch suppresses the activation of negative categories for any given point. Using spatial and color constraints, the smooth branch ensures that spatially connected points with similar colors have the same labels. SQN [69] encodes a set of hierarchical latent representations and retrieves subsets according to spatial positions. These representations are fed into MLPs to predict semantic labels. DAT [76] incorporates dual adaptive transformations using an adversarial strategy to leverage local consistency and structural smoothness. These methods require adaptive and high-quality sampling, which is difficult when the input data scales vary.

Another line of methods is to group input point clouds into sub-clouds or superpoints. Based on a classification network trained with sub-cloud level labels, Wei et al. [59] presented a multi-path region mining network to produce point-level pseudo labels for fully supervised training. By constructing a SPG, SSPC-Net [29] achieves point cloud semantic segmentation through a semi-supervised graph neural network (GNN). The features of superpoints are extracted as input for the generation and propagation of pseudo labels. A coupled attention mechanism is employed to enhance the extraction of discriminative contextual features. Deng et al. [27] used the superpoint as a constraint and a guide for pseudo label propagation. This framework consists of a superpoint generation module for superpoint generation, a pseudo-label optimization module for the identification of pseudo labels with low confidence, a superpoint feature aggregation module for feature extraction, and an edge prediction module for edge constraints. Refer to Ref. [27] for an illustration of the superpoint-based structure. Although these methods have achieved instance segmentation with much fewer annotations, they highly depend on the grouping quality.

As an alternative strategy to the above methods, learning from unlabeled points can act as a pre-training pretext. Drawing inspiration from recent developments in contrastive learning for self-supervised tasks, Jiang et al. [77] introduced guided point contrastive learning, which improves feature representation in the semi-supervised network. Augmented point clouds generated from input point clouds are fed into an unsupervised branch for backbone network training. The backbone network, classifier, and projector are shared with the supervised branch to produce semantic scores. By incorporating self-supervised learning, Zhang et al. [61] proposed a two-component network for weakly supervised point cloud semantic segmentation. Prior knowledge is learned from large-scale unlabeled points via a self-supervised network. Together with a sparse label propagation mechanism, the prior information is transferred to a weakly supervised network for label prediction. Zhang et al. [63] proposed a perturbed self-distillation framework for point cloud semantic segmentation tasks. The core of this framework is to maintain consistency between the perturbed branch and the original branch, bridging the information between labeled and unlabeled data. The consistency constraints are imposed to establish a graph topology among all the points. Besides, the semantic context of labeled points is used to monitor the overall understanding of the point clouds. One thing one click [30] performs semantic segmentation with one annotated point per object. A self-training approach with label propagation is integrated into this framework. With such sparse supervision, the semantic and geometric similarities are learned to generate and update pseudo labels. HybridCR [66] uses a hybrid contrastive regularization for finding the similarity in local neighborhoods and global contexts. PointMatch [78] learns consistent representations from sparse annotations by improving the quality of pseudo labels. This is achieved by introducing super point information. Recently, Liu et al. [79] combined active learning with self-training to enhance instance segmentation performance by selecting points to be annotated. While pre-training is promising, it still requires large amounts of data for training, and it can be difficult to fine-tune models from other tasks.

5.2 Point cloud instance segmentation

Instance segmentation involves the identification and labeling of each object as a separate instance. Image-based instance segmentation can be divided into two distinct categories: detection-based methods and detection-free methods. Detection-based approaches first predict the localization of each object to generate proposals and then obtain pre-pixel instance masks [8]. For instance, YOLO [80] predicts semantic classes and the target bounding boxes for different image grids to complete image segmentation. Detection-free methods rely on the semantic segmentation results and then use clustering techniques to obtain instance labels. In particular, PFN [81] designs a framework that trains three sub-tasks, i.e., semantic segmentation, instance location, and instance count for each category. The final instance-level segmentation results are obtained by clustering. With the rise of deep learning in 3D data and the availability of large-scale annotated point cloud datasets, there has been increasing attention on deep learning based segmentation of 3D point cloud instances.

5.2.1 Fully supervised instance segmentation

Full supervised point cloud instance segmentation requires point-level instance labels. Similar to image-based instance segmentation, fully supervised methods can also be categorized into detection-based and detection-free methods. Refer to Ref. [53] and Ref. [50] for details.

Detection-based methods first predict 3D bounding boxes and then produce point-level instance masks. GSPN [53] adopts an analysis-by-synthesis strategy and produces object proposals. A region-based PointNet is designed to refine proposals and generate instance segmentation. 3D-BoNet [56] performs end-to-end regression of 3D bounding boxes and predicts point-level masks for all instances. It consists of a backbone network, followed by two parallel branches. One branch is dedicated to bounding box regression, while the other branch focuses on point mask prediction. GICN [82] approximates the instance center of each object as a Gaussian distribution. This Gaussian distribution is then sampled to generate candidates, which are subsequently used to generate corresponding bounding boxes and instance masks.

Detection-free methods first predict point-level semantic labels and then group points into instances. SGPN [50] is an early deep learning framework for point cloud instance segmentation. Using PointNet++ as the backbone, the SGPN predicts group proposals based on a similarity matrix. ASIS [54] produces point-level instance labels via joint training with semantic supervision. Similarly, JSNet [83] and JSIS3D [84] also benefit from training instance and semantic segmentation simultaneously. Liang et al. [85] used a GNN based on attention-based neighbor search to obtain discriminative features under both semantic and instance supervision. Mean-shift post-processing was then employed to cluster embeddings for final predictions. SoftGroup [67] is a two-step framework that consists of bottom-up grouping and top-down refinement. Given the input point clouds, a soft grouping module is used to produce instance proposals on the basis of semantic scores and offset vectors. While most detection-free methods require post-processing, such as center voting or non-maximum suppression, Mask3D [71] utilizes a Transformer-based module to directly predict instance masks. Semantic and geometric information is encoded into point features through a stacked Transformer decoder, which provides an instance heatmap that indicates the similarities among the point clouds. Recently, SPFormer [86] has been developed to directly predict instances in an end-to-end manner based on superpoint cross-attention. Superpoint features are aggregated from point clouds and used as input to the Transformer decoder.

In recent years, detection-based methods, which attempt to perform the instance segmentation task by a separate detection step, have received less attention than detection-free approaches that aim for an end-to-end solution. Moreover, different backbones with varying levels of annotation have been explored. However, the accuracy of instance segmentation is still low, and the generality of existing methods lacks strong empirical evidence.

5.2.2 Weakly supervised instance segmentation

While fullly supervised point cloud instance segmentation can suffer from performance degradation when dense annotations are unavailable, weakly supervised frameworks attempt to classify points into objects with small numbers of labels.

Liao et al. [87] proposed a semi-supervised framework for point cloud instance segmentation by using a bounding box for supervision. The input point clouds were decomposed into subsets by bounding box proposals. Semantic information and consistency constraints were used to produce final instance masks. Hou et al. [88] designed a pre-training method that could gracefully fine-tune an instance segmentation network. To further enhance feature exploitation, the spatial information was integrated into contrastive pre-training. Tang et al. [70] grouped point clouds into superpoints and explored inter-superpoint spatial and semantic relationships. The final instance segmentation was completed by clustering with volume constraints. To address the issue of ambiguous labels due to intersections among bounding boxes, WISGP [31] divided points into two distinct sets. The univocal set consists of points with clear instance labels, while the equivocal set comprises points with uncertain belongings. Geometric representations such as polygon meshes and superpoints were employed to propagate univocal labels to connected equivocal points. Pseudo labels were assigned to the remaining equivocal points based on an instance segmentation network. The model was retrained with all labeled points to produce final instance segmentation results. One Thing One Click++ [89] expanded the previous self-training framework for weakly supervised 3D instance segmentation. A 3D-Unet and a Relation Net were employed to aggregate features and learn pairwise similarities. The initial pseudo labels generated by annotations were iteratively updated to refine the final outputs. To further alleviate dependency on annotations, FreePoint [90] explored unsupervised point cloud instance segmentation. A multi-cut algorithm was used to group point clouds into coarse instance masks based on point features that consist of coordinates, colors, normals, and self-supervised deep features. This grouping generated pseudo labels for weakly supervised network training. The framework can be integrated into supervised semantic instance segmentation as an unsupervised pre-training pretext. The aforementioned weakly supervised methods have achieved significant improvements in recent years, but they still face difficulties in handling unbalanced data.

6 Learning-based segmentation with multi-modality

Recent advances in foundation models of 2D vision and NLP have inspired the exploration of multi-modality methods in 3D models [12, 9198]. For instance, Peng et al. [97] proposed a zero-shot approach that co-embeds point features with images and text. Rozenberszki et al. [12] presented a language-grounded method by discovering the joint embedding space of point features and text features. Liu et al. [91] transferred the knowledge from 2D to 3D for part segmentation. Wang et al. [92] trained a multi-modal model that learns from vision, language, and geometry to improve 3D semantic scene understanding. Xue et al. [93] introduced a unified representation of images, text, and 3D point clouds by aligning them during pre-training. Ding et al. [94] distilled knowledge from vision-language models for 3D scene understanding tasks. Zeng et al. [95] aligned 3D representations to open-world vocabularies via a cross-modal contrastive objective. Zhang et al. [98] performed text-scene paired semantic understanding with language-assisted learning. How to facilitate and adapt multi-modalities with point clouds for better scene understanding is worth exploring. These methods utilize rich information from vision and text, enabling a more comprehensive representation of the indoor scene. However, these approaches require high computational resources, and pre-training is highly dependent on limited multi-modal datasets.

7 Performance evaluation

7.1 Evaluation metrics

The widely adopted evaluation metrics for indoor point cloud semantic segmentation include overall accuracy (OA), mean intersection over union (mIoU), and mean accuracy (mAcc).

The standard evaluation metric for indoor point cloud instance segmentation is mean average precision with IoU thresholds (mAP) from 0.5 to 0.95. In particular, mAP@50 is AP score with IoU thresholds of 0.5. Additionally, mean precision (mPrec) and mean recall (mRec) are frequently used criteria on the S3DIS dataset.

7.2 Results on public datasets

Semantic segmentation results. Tables 1 and 2 present indoor point cloud semantic segmentation results of different methods on S3DIS Area 5 and ScanNet v2, respectively. We can observe that the state-of-the-art methods outperform the pioneering work of PointNet [41] with more than 20% mIoU gains. Transformer-based methods [48, 49, 65] have been the dominant methods in recent years, following the great success in NLP and image understanding. Meanwhile, several weakly supervised methods show the possibility of achieving semantic segmentation with fewer data, reaching more than 65% of mIoU on S3DIS Area 5 and 70% on ScanNet. These results are encouraging, although there is still a gap between fully supervised and weakly supervised approaches. It is desirable to further improve the ability to extract features from limited annotated data.

Table 1 Semantic segmentation results of different methods on S3DIS Area 5
Table 2 Semantic segmentation results of different methods on ScanNet v2 validation set and test set

Instance segmentation results. Tables 3 and 4 present indoor point cloud instance segmentation results of different methods on S3DIS Area 5 and ScanNet v2, respectively. Detection-free methods have received more attention than detection-based methods because they attempt to complete the instance segmentation task in an end-to-end manner. Several networks [31, 68, 70, 87, 89] have started to learn instance information from limited annotation. These results clearly show that there is still room for improvement in point cloud instance segmentation using weakly supervised learning.

Table 3 Instance segmentation results of different methods on S3DIS Area 5
Table 4 Instance segmentation results of different methods on ScanNet v2 validation set and test set

8 Discussion

Point cloud segmentation is a crucial task in 3D indoor scene understanding. With the availability of 3D datasets, deep learning-based segmentation methods have gained significant attention and have contributed to the progress in this field. However, obtaining accurate segmentation results often requires dense annotations, which is a laborious and costly process. In order to mitigate the reliance on extensive annotations and enable learning from limited labeled data, the research focus has shifted towards weakly supervised approaches in recent years. By exploring weakly supervised frameworks, researchers aim to achieve satisfactory segmentation results while minimizing the annotation efforts and costs involved. Despite the rapid development of point cloud segmentation, existing frameworks still face several challenges.

8.1 Datasets and representations

The size of annotated point cloud data is still limited compared to that of image datasets. Although acquiring point clouds becomes affordable, annotating point clouds is still a time-consuming task. Since both fully supervised and pre-training [99, 100] require a large amount of data, larger datasets with more diverse scenes are desired to advance learning-based point cloud segmentation. Therefore, an efficient and user-friendly annotation method for large datasets is needed. This might be achieved by unsupervised approaches with geometric priors. The recently developed datasets, such as the ScanNet200 dataset [12], have drawn increasing attention to imbalanced learning [101] in point cloud segmentation.

Existing point cloud segmentation methods use different data formats, including point clouds, RGB-D images, voxels, and geometric primitives. Each data format has its advantages and drawbacks in different 3D scene understanding tasks. On the basis of point-based networks, we can now directly process point clouds for training and reasoning. Obviously, not all points are needed for scene perception. For indoor scene point cloud data, finding a better representation is still a promising direction of research.

8.2 Data efficiency and multi-modality

Data-efficient learning frameworks are highly desirable because they alleviate the burden of collecting extensive dense annotations for training the model. Although current weakly supervised point cloud segmentation methods can achieve competitive performance with fully supervised learning, there are still gaps to be filled. More importantly, the generality and robustness of these data-efficient methods are not convincing, as they mainly test on public datasets with limited sizes rather than on open-world scenes. Therefore, further exploration of generalist models is the trend for the future.

One promising route is to integrate other modalities, such as images and natural languages. Previous works [37, 102, 103] have explored the combination of 2D images and 3D point clouds for better understanding of scenes. Recent developments in foundation models of 2D vision and NLP have served as inspiration sources for investigating multi-modalities in 3D data [12, 9198]. While these methods achieve incredible results in different 3D tasks, adapting knowledge from other modalities to indoor point cloud segmentation is still challenging. In addition, collecting adequate multi-modal pre-training data can be costly. How to facilitate and adapt multi-modalities with point clouds for a better understanding of indoor scenes is worth exploring.

8.3 Learning methods for dynamic scene segmentation

Current learning-based indoor point cloud segmentation methods are mostly designed for static scenes. Indoor objects can be moved around in real-world scenarios, allowing for a more comprehensive representation of the indoor scene. Moreover, annotating such dynamic scenes is even more costly than annotating 3D point clouds. 4D representation learning has become the core of dynamic feature exploitation. Recent work [104, 105] has explored 4D feature extraction and distillation to improve downstream tasks such as scene segmentation. Transferring such information to varying scales of indoor scenes is still challenging. The development of learning methods for dynamic scene segmentation is an interesting prospect for further investigation.

9 Conclusion

Point cloud segmentation plays a key role in 3D vision and intelligence. This paper aims to provide a concise overview of point cloud segmentation techniques for understanding 3D indoor scenes. First, we present public 3D point cloud datasets, which are the foundation of point cloud segmentation research, especially for deep learning-based methods. Second, we review representative approaches for indoor scene point cloud segmentation, including geometry-based and learning-based methods. Geometry-based approaches extract geometric information and can be combined with learning-based methods. Learning-based methods can be divided into structured data-based and point-based methods. We mainly consider point-based semantic and instance segmentation frameworks, including fully supervised networks and weakly supervised networks. Finally, we discuss the open problems in the field and outline future research directions. We expect that this review can provide insights into the field of indoor scene point cloud segmentation and stimulate new research.