1 Introduction

Extracting 3D skeletal structures from tree point clouds remains a significant challenge in computer graphics and computer vision. These skeletal structures are crucial for digital tree modelling, with diverse applications including biomass estimation [1,2,3], growth modelling [4,5,6], forestry management [7,8,9], urban microclimate simulations [10], and agri-tech applications, such as robotic pruning [11, 12] and fruit picking [13]. Cardenas et al. [14] recently surveyed existing 3D tree skeletonization methods, classifying them into three categories: thinning, clustering, and spanning tree refinement.

Thinning methods [15,16,17,18,19,20] operate by contracting the surface points onto the medial axis. Following this contraction, a simplification procedure is applied to extract the skeletal structure. While thinning methods are effective when the input point cloud is adequately sampled, they struggle with noise and occlusion.

Clustering methods [21,22,23] seek to group points into bins that share the same branch cross-section. These groups are formed using a neighbourhood function, typically implemented with k-nearest neighbours (KNN) within a specified search radius. However, clustering can introduce challenges when neighbouring branches get too close, potentially leading to erroneous connections due to clusters forming between neighbouring branches. Furthermore, if the value of K or the search radius is too small, the resulting skeleton can have multiple disconnects.

Spanning tree refinement methods [24,25,26] work by connecting neighbouring surface points and producing a spanning tree, which is then refined through global and local optimizations to remove noisy branches and refine existing estimates. Constructing a skeleton on the surface points leads to difficult optimizations to retrieve a geometrically accurate skeleton.

A deep-learning-based skeletonization was proposed in TreePartNet [27]. This method uses two networks, one to detect branching points and another to detect cylindrical representations. It requires a sufficiently sampled point cloud as it relies on the ability to detect junctions accurately and embed local point groups. In our prior work, we proposed Smart-Tree [28], which utilizes a sparse sub-manifold CNN [29,30,31] to predict the position of the medial axis accurately.

In tree point cloud skeletonization, challenges arise from subtractive and additive noise sources. Subtractive noise results from factors such as self-occlusions and reconstruction inaccuracies, while additive noise can be attributed to sensor inaccuracies, environmental conditions, and other factors like sensor noise, calibration errors, poor illumination, depth discontinuities, atmospheric conditions, and movement in the branches and leaves [32].

This research presents a method for evaluating skeletonization algorithms using synthetic point clouds. 3D Perlin noise is employed to simulate subtractive noise, while Gaussian noise is used to emulate additive noise. By applying noise to these synthetic point clouds, we offer a controlled environment for assessing the robustness of skeletonization algorithms. The primary contributions of this work include:

  • The introduction of a labelled synthetic tree point cloud dataset.

  • The development of a method to systematically evaluate skeletonization algorithms across distinct noise levels.

  • An enhanced assessment for our Smart-Tree algorithm.

  • Empirical evidence highlighting the efficacy of a learned approach in approximating the medial axis.

2 Datasets

To facilitate quantitative evaluation, it is essential to have a point cloud dataset with ground truth skeletons. However, manual annotation is challenging and labour-intensive. To address this, we have developed a synthetic dataset featuring ten diverse tree species. This is an improvement over our initial dataset that only contained 6 species [28]. Additionally, we have conducted a qualitative assessment using real-world data. Our future work will prioritize efficient real-world data annotation.

2.1 Real-world dataset

We tested our method on a tree in the Christchurch Botanic Gardens, New Zealand. To generate our 3D reconstructions, we utilize a NeRF [33] framework that learns an implicit representation of a scene light field from a set of input views. For our objectives, we extend NeRF to produce explicit 3D point clouds. Notably, this NeRF reconstruction approach is an excellent choice for reconstructing tree structures, as it can effectively use many images. Consequently, it can produce accurate reconstructions without relying on commonly used constraints that are ill-fitted to retaining high-frequency structures. This provides an advantage over traditional multi-view stereo approaches, which struggle with thin structures, such as twigs and leaves [34].

To get our method to work on this data, we train our network to segment away leaves using our synthetic dataset. After that, we apply our skeletonization algorithm to the remaining points.

2.2 Synthetic dataset

Our synthetic tree models were created using SpeedTree [35]. This dataset encompasses trees of diverse shapes, sizes, and complexities. It includes ten distinct tree species from the SpeedTree Cinema library, as shown in Fig. 1. There are twenty unique variations of each species, resulting in a total of 200 tree mesh models. Table 1 provides the statistics of the synthetic dataset. To transform the meshes into point clouds, the following steps are applied:

  1. 1.

    Twigs and leaves are removed. Twigs are removed by eliminating branches with an initial radius smaller than 2 cm or a length under 8 cm.

  2. 2.

    Branch meshes intersecting other branch meshes were removed, as well as the predecessors.

  3. 3.

    The mesh was point sampled at a rate of 1 point per square centimetre.

Fig. 1
figure 1

SpeedTree models: a Apple, b Tibetan Cherry, c Chinaberry, d Dracaena, e Ginkgo Biloba, f London Plane, g Japanese Maple, h Scots Pine, i Colorado Blue Spruce, j Walnut Sapling

Table 1 Summary of synthetic tree point cloud data-set

3 Methods

Our skeletonization method comprises several stages as shown in Fig. 2. We use labelled synthetic point clouds to train a sparse convolutional neural network to predict each input point’s radius and direction toward the medial axis (ground truth skeleton). Using the radius and medial direction predictions, we map surface points to the estimated medial axis positions and construct a constrained neighbourhood graph. This frequently results in multiple connected subgraphs due to gaps from self-occlusion by branches and leaves. We process each sub-graph independently using our subgraph algorithm, which employs a greedy approach to find paths from the root to terminal points. The skeletal structures from each subgraph are combined to form the final skeleton.

The neural network predictions help to avoid ambiguities with unknown branch radii and separate points that would be close in proximity but from different branches.

Fig. 2
figure 2

Overall pipeline methodology

3.1 Neural network

Our network takes an input set of N arbitrary points \(\left\{ Pi | i = 1,\ldots , N \right\} \), where each point Pi is a vector of its (xyz) coordinates plus additional features such as colour (rgb). Each point is voxelized at a resolution of 1 cm. Our proposed network will then, for each voxelized point, learn an associated radius \(\left\{ Ri | i = 1,\ldots , N \right\} \) where Ri is a vector of corresponding radii, and a direction vector; \(\left\{ Di | i = 1,\ldots , N \right\} \) where Di is a normalized direction vector pointing towards the medial axis.

The network is implemented as a submanifold sparse CNN using SpConv [36] and PyTorch [37]. We use regular sparse convolutions on the encoder blocks for a wider exchange of features and submanifold convolutions elsewhere for more efficient computation due to avoiding feature dilation. The encoder blocks use a stride of 2. The encoder and decoder blocks use a kernel size of \(2\times 2\times 2\) whereas the other convolutions use a kernel size of \(3\times 3\times 3\) except for the first sub-manifold convolution, which uses a kernel size of \(1\times 1\times 1\).

The architecture comprises a U-Net backbone [38] with residual connections [39], followed by two smaller fully connected networks to extract the radii and directions. The U-Net architecture, with its feature extraction and precise localization capabilities, is well-suited for predicting radius and direction. Its structure combines a contracting path for feature extraction and an expansive path for localization, enhanced by skip connections that preserve detail. This makes U-Net highly effective for tasks requiring accurate interpretation of complex shapes. Its success in medical imaging [40] demonstrates its proficiency in accurately extracting spatial features, aligning with our project’s needs. The Residual Block in our network architecture consists of a convolutional branch with two submanifold convolution layers and an identity branch. The identity branch is activated when the input and output channels are the same, facilitating direct feature transfer. A ReLU activation function and batch normalization follow each convolutional layer. When branch-foliage segmentation is required, a fully connected class block is added, which has a final softmax activation layer.

A high-level overview of the network architecture is shown in Fig. 3.

Fig. 3
figure 3

Network architecture diagram

A block sampling scheme ensures the network can process larger trees. During training, for each point cloud, we randomly sample (at each epoch) a \(4\,m^{3}\) block and mask the outer regions of the block to avoid inaccurate predictions from the edges. Apart from this, we also employ various other augmentations to improve the generalizability and robustness of the network. Specifically:

  • Scale Augmentation: We randomly scale the input point cloud within the range of 0.9–1.1 to introduce variations in the size.

  • Point Dropout: With a probability of \(0.2\), certain points from the input are randomly dropped out. This introduces sparsity in the data and tests the resilience of the network.

  • Gaussian Noise: Random Gaussian noise with a mean of \(0.0\) and a standard deviation of \(1.0\) is added to the input point cloud. The probability of this noise application is \(1.0\), and its magnitude is \(0.01\). This improves the network’s ability to handle noisy inputs.

During inference, we tile the blocks, overlapping the masked regions to avoid inaccurate predictions from the edges.

To accommodate the variation in branch radii, which spans several orders of magnitude [41], we estimate a logarithmic radius. This approach results in a relative error. Our loss function, as presented in Eq. 1, consists of two components: the L1-loss for the radius and the cosine similarity for direction loss. We employ the Adam optimizer with a batch size of 8 and an initial learning rate of 0.01. If the validation loss fails to improve over 10 consecutive epochs, we reduce the learning rate by a factor of 10.

$$\begin{aligned} \text {Loss} = \quad \underbrace{\sum _{i = 0}^n |\ln (Ri) - \hat{Ri}|}_{\text {Radius Loss}} \quad + \quad \underbrace{\sum _{i = 0}^n \frac{Di \cdot \hat{Di}}{||Di||_2\cdot ||\hat{Di}||_2}}_{\text {Direction Loss}} \end{aligned}$$
(1)

3.2 Subgraph algorithm

Fig. 4
figure 4

a \(B_0\) farthest point, b \(B_0\) trace path, c \(B_0\) allocated points, d \(B_1\) farthest (unallocated) point, e \(B_1\) trace path and allocated points, f branch skeletons

Due to self-occlusion and noise inherent in the point cloud, we often encounter multiple connected components, as depicted in Fig. 2. We refer to each of these connected components as a sub-graph. These sub-graphs are processed sequentially. Figure 16 illustrates the output skeletons for each sub-graph derived from real data. For each sub-graph:

  1. 1.

    A distance tree is created based on the distance from the root node (the lowest point in each sub-graph—shown in red in Fig. 4a) to each point in the sub-graph.

  2. 2.

    We assign each point a distance based on a Single Source Shortest Path (SSSP) algorithm. A greedy algorithm extracts paths individually until all points are marked as allocated (steps a to f).

  3. 3.

    We select a path to the furthest unallocated point and trace its path back to either the root (Fig. 4b) or an allocated point (Fig. 4d).

  4. 4.

    We add this path to a skeleton tree (Fig. 4f).

  5. 5.

    We mark points as allocated that lie within the predicted radius of the path (Fig. 4c).

  6. 6.

    We repeat this process until all points are allocated (Fig. 4d, e)

figure a

4 Experiments

We assessed our method’s resilience to real-world tree point-cloud artefacts, like noise and missing points, using augmentations (see Sect. 4.1). We compared our approach with the semantic Laplacian-based algorithm by [20]. [20], measuring robustness via metrics in Sect. 4.2. Our tests involved 20 synthetic dataset trees. Furthermore, we tested our approach on real-world data.

4.1 Point cloud augmentations

To simulate real-world point cloud noise, we adopted two noise profiles: subtractive and additive. The subtractive profile utilizes 3D Perlin noise [42] for its capacity to generate coherent, smooth patterns suitable for mimicking localized point dropouts. In contrast, the additive profile is based on Gaussian noise.

Using Taichi [43,44,45], we developed a GPU-accelerated Perlin noise generator, available at https://github.com/uc-vision/taichi_perlin. Both profiles can be adjusted in intensity for sensitivity analysis. In our experiments, for the additive profile, we varied the Gaussian noise magnitude. For the subtractive profile, we altered the point dropout percentages. Detailed specifications and visual outputs of these profiles can be found in Table 2 and Fig. 5.

Table 2 Noise generator configurations
Fig. 5
figure 5

Noise applied to synthetic apple point clouds. Top: subtractive Perlin noise with magnitudes a 0.0, b 0.10, c 0.30, d 0.50. Bottom: Additive Gaussian noise with magnitudes e 0.000, f 0.005, g 0.015, h 0.025

4.2 Metrics

In the field of tree point cloud skeletonization, one of the significant challenges identified in the literature is the selection of appropriate metrics for the quantitative evaluation [14]. In response to this, we propose an enhanced approach for evaluating the robustness of each method, incorporating a modified set of point cloud reconstruction metrics.

We use the following metrics to assess our approach: f-score, precision, recall, and AUC over a range of radius thresholds. For the following metrics, we consider \(p^* \in \mathcal {S}^* \) points along the ground truth skeleton and \(p \in \mathcal {S} \) estimated medial axis points. \( p_r \) is the radius at each point. We use a threshold variable \({t}\), which sets the distance points must be within based on a factor of the ground truth radius. We test this over the range of 0.0–1.0. The f-score is the harmonic mean of the precision and recall.

Skeletonization Precision: To calculate the precision, we first get the nearest points from the medial axis points \(p_i \in \mathcal {S}\) to the ground truth skeleton \(p_j^* \in {\mathcal {S}}^*\), using a distance metric of the euclidean distance relative to the ground truth radius \(r_j^*\). The operator \(\llbracket . \rrbracket \) is the Iverson bracket, which evaluates to 1 when the condition is true; otherwise, 0.

$$\begin{aligned} d_{ij}= & {} ||p_i - p_j^*|| \end{aligned}$$
(2)
$$\begin{aligned} P(t)= & {} \frac{100}{|S|} \sum _{i \in \mathcal {S}} \llbracket d_{ij} < t \ r_j^* \wedge \mathop {\forall }_{k \in \mathcal {S}} d_{ij} \le d_{kj} \rrbracket \end{aligned}$$
(3)

Skeletonization Recall: To calculate the recall, we first get the nearest points from the ground truth skeleton \(p_j^* \in {\mathcal {S}}^*\) to the output medial axis points \(p_i \in \mathcal {S}\). We then calculate which points fall inside the thresholded ground truth radius. This gives us a measurement of the completeness of the output skeleton.

$$R(t) = \frac{100}{|S^*|} \sum _{j \in {\mathcal {S}}^*} \llbracket d_{ij} < t \ r_j^* \wedge \mathop {\forall }_{k \in {\mathcal {S}}^*} d_{ij} \le d_{ik} \rrbracket$$
(4)

5 Results

In this study, we conducted a comparative analysis of our Smart-Tree (ST) method [28] against the Semantic Laplacian-based Contraction (SLBC) algorithm [20]. Previously, we had evaluated ST alongside the AdTree algorithm [26]. However, in the current analysis, we excluded AdTree due to the SLBC algorithm’s distinct output format. Unlike AdTree, which produces skeletal lines, SLBC outputs point data, presenting challenges in applying the same evaluation metric.

The choice to compare ST with SLBC was driven by their methodological similarities, as both are contraction-based approaches. This comparison aims to highlight the distinct advantages and limitations of our ST method in relation to SLBC.

Looking ahead, we plan to broaden our comparative framework to encompass deep-learning-based methods. An example of such an approach is TreePartNet [27]. However, in this instance, a comparison with TreePartNet was not viable due to its limitation in processing point clouds, specifically restricted to no more than 16K points. This constraint did not match the dataset parameters of our current study and is less applicable to real-world data scenarios (Fig. 6).

5.1 Gaussian (additive noise)

The SLBC method consistently outperforms the ST in precision AUC at all noise levels, as depicted in Table 3 and visualized in Fig. 7. On the other hand, ST typically exceeds SLBC in Recall AUC, with this advantage becoming more pronounced as noise levels rise, shown in Fig. 8. For F1 AUC, both methods show competitive results, though SLBC has a marginal advantage in most cases, as observed in Fig. 9. One reason for SLBC’s superior precision is its iterative approach (Fig. 10). This method provides a more robust and adaptable mechanism, facilitating better convergence to the medial axis, as shown in Fig. 6. While ST achieves a more consistent recall AUC as noise intensifies, SLBC struggles to recover smaller branches under increased noise conditions, and suffers from mal-contraction as shown in Fig. 6h.

Fig. 6
figure 6

Examples of additive noise outputs: a Noise applied to Apple Tree@0.015, b SLBC Output, c ST Output, d Noise applied to Walnut Tree@0.025, e SLBC Output, f ST Output, g ground truth skeleton of Cherry Tree, h SLBC Output—showing a mal-contraction, i ST Output

Fig. 7
figure 7

Macro precision at each radius threshold for Gaussian (additive) noise

Fig. 8
figure 8

Macro recall at each radius threshold for Gaussian (additive) noise

Fig. 9
figure 9

Macro F1 score at each radius threshold for Gaussian (additive) noise

Table 3 AUC results for additive noise

5.2 Perlin (subtractive noise)

As the intensity of the Perlin subtractive noise rises, indicating a greater likelihood of point dropouts, both the ST and SLBC methods experience a decline in Precision, Recall, and F1 score AUC values, as shown in Table 4 and observed in Figs. 11 and 13. The recall is especially impacted due to the increased challenge of recovering missing regions. Notably, the ST method displays greater resilience to this noise type, as evidenced by its slower degradation rate as observed in Fig. 12. The disparity in recall, precision, and F1 performance becomes more pronounced with increased noise, highlighting ST’s robustness, especially at higher noise levels in Fig. 10b, e, undesirable artifacts in the SLBC method are visible. The points fail to contract to the medial axis, especially on the larger branches, and the output shows more gaps (Figs. 13, 14, 15).

Table 4 AUC results for subtractive noise
Fig. 10
figure 10

Examples of subtractive noise outputs: a noise applied to Chinaberry Tree@0.2, b SLBC output, c ST output, d noise applied to London Tree@0.5, e SLBC Output, f ST Output. g Ground truth skeleton of Cherry Tree, h SLBC Output—showing a mal-contraction, i ST output

Fig. 11
figure 11

Macro precision at each radius threshold for Perlin (subtractive) noise

Fig. 12
figure 12

Macro recall at each radius threshold for Perlin (subtractive) noise

Fig. 13
figure 13

Macro F1 score at each radius threshold for Perlin (subtractive) noise

Fig. 14
figure 14

Point cloud

Fig. 15
figure 15

Branch mesh

5.3 Real-world data

To demonstrate our method’s ability to work on real-world data. We test our method on a tree from the Christchurch Botanic Gardens, New Zealand. As this tree has foliage, we train our network to segment away the foliage points and then run the skeletonization algorithm on the remaining points. In Fig. 16, we can see that Smart-Tree can accurately reconstruct the skeleton.

Fig. 16
figure 16

Skeleton sub-graphs

6 Conclusion and future work

We proposed an enhanced method for evaluating the skeletonization of point clouds, specifically focusing on estimating the medial axis of tree point clouds. Our research demonstrates the advantages of our previously developed approach, which utilizes a learned method for medial axis approximation. This approach exhibits robustness when dealing with additive and subtractive noise in point cloud data.

In the future, our aim is to further enhance the robustness of our method by addressing gaps in the point cloud. To achieve this, we plan to develop techniques for filling in these gaps during the medial-axis estimation phase. Additionally, we intend to expand the scope of our research by training our method on a more diverse range of synthetic and real trees. To facilitate this expansion, we will enrich our dataset with a wider variety of trees, including those with foliage, and incorporate human annotations for real trees. This will lead to improved performance on a wider range of trees. Furthermore, we are actively working on refining our error metrics to better capture topology-related errors in our evaluations.