1 Introduction

The world’s forests cover 31% of the total land area (FAO 2020), which makes them one of the most important ecological systems on earth. Trees are fundamental for life on earth because of their ability to absorb carbon-dioxide and produce oxygen in return. They affect regional climates and the hydrological balance. Additionally, they provide a habitat to a plethora of living creatures and serve humans as a renewable resource for construction material and energy. Thus, they are fundamental to biodiversity and human life. Another important part of this ecosystem is dead wood, which is part of the natural life cycle. When trees die, they decompose and form a new type of habitat for insects, fungi, and birds (BMEL 2015). This process recoups nutrients for the growth of new trees. However, an increase in tree loss due to climate change and related extreme weather conditions, such as storms, droughts, and their respective consequences, have been observed. Thus, a good understanding of forest structures and tree stands is required for successful forest management, conservation, and planning. Knowledge of tree species, health, and location within a forest area is of particular interest. Traditionally, this information is obtained in situ using measurements from sample plots and extrapolated to larger scales (McRoberts and Tomppo 2007).

The capture of information relevant to forestry was accelerated through the emergence of remotely sensed data from space or airborne platforms, such as satellites, airplanes, helicopters, or remotely piloted aircrafts. These can be equipped with multi- or hyperspectral sensors to capture high-resolution images. In recent years, data acquisition has extended to using lidar that penetrates the tree canopy and provides information from the lower layers of the forest. Additionally, the laser points provide geometric and radiometric information (Korpela et al. 2010). In a preprocessing step, the single trees are then segmented using the captured point clouds to assess further information at the single/object level.

Conventionally, features are developed and engineered to describe different traits, such as tree health and species. Usually, a statistic classification or machine learning (ML) model is trained to then infer labels on unlabeled data. Fassnacht et al. (2016) showed that well-known parametric and non-parametric approaches—such as maximum likelihood, support vector machines, or random forest—are often used to classify trees, whilst neural networks are not commonly employed. However, recent developments in the field of computer vision and deep learning have attracted more attention to the development of models that learn from three-dimensional data (Guo et al. 2021). These approaches can alleviate the time needed to develop hand-crafted features from the point clouds, because they extract information and build representations automatically. However, they need much more training data compared to traditional ML methods.

1.1 Deep Learning on Point Clouds

Deep learning on images has become very successful with the rise of convolutional neural networks (CNNs). These networks use the regular grid structure of images and can derive information through the neighborhood relationships of pixels using convolution operators. However, the application of this approach on point clouds reveals itself as a challenge, because point clouds are inherently irregular and unordered. In recent years, an increasing number of publications have dealt with deep learning on point clouds. These publications provide various approaches to address problems such as 3D point cloud segmentation, point cloud registration, 3D reconstruction, or object classification. In this paper, we focus on classification algorithms for tree classes at the tree level extracted from lidar data. Griffiths and Boehm (2019) provide a very good overview of the techniques for applying deep learning on three-dimensional sensed data. Guo et al. (2021) also present an extensive overview of deep learning models applied specifically to point clouds.

In the field of object classification, there are different approaches for handling and processing data. Initial attempts, such as the multi-view convolutional neural network (MVCNN) (Su et al. 2015), projected the 3D data on two-dimensional planes prior to processing. Projecting the objects onto several planes in space generates multiple images. These projections of the point cloud are then handled like usual images and can be processed by the well-known 2D convolutions. Other approaches were developed to discretize the point cloud directly within three-dimensional space. A 3D grid structure is generated by voxelizing the space and summarizing multiple points into a spatial unit. The 3D convolution can be applied to the voxels to further extract global descriptors. A notable example of this kind is VoxNet (Maturana and Scherer 2015). However, this subdivision of space does not scale well for denser point clouds.

To avoid the possible loss of information due to the discretization of the point cloud, newer networks have been developed that act directly on the points. A notable example is PointNet (Qi et al. 2017). This network learns pointwise features through several multilayer perceptrons (MLPs). Because this network does not capture local information and the features are learned independently, Qi et al. (2017) proposed its successor PointNet++. This network applies PointNet learning layers in a hierarchical fashion to smaller groups of points to learn their neighborhood structures. PointNet and PointNet++ are widely considered as milestones in deep learning on point clouds (e.g., in Jiao and Yin 2020; Wang et al. 2020). Both networks can be categorized as point-based MLP methods.

Another category of deep learning networks that act directly on points are convolution-based approaches. These networks often consider the neighborhood of points and then use convolution to act on them. For the convolution, either continuous or discrete representations of the points are used. This can be realized through an operator to bring them into a latent or canonical space.

A relatively new network architecture is PointCNN, proposed by Li et al. (2018), which is evaluated in the present study. We compare its performance to 3D-modified Fisher Vectors Net (3DmFV-Net), proposed by Ben-Shabat et al. (2018). 3DmFV-Net network uses a discrete three-dimensional grid together with a continuous function to transform the point cloud into another representation and subsequently apply 3D convolutions. Therefore, this method has been coined as a hybrid approach.

Both networks are further explained in Sect. 3. Some point cloud deep learning networks have already been successfully applied to forest related tree classifications or segmentation. Such studies are presented in the following section.

1.2 Related Work

Research on automated tree species classification from lidar is mostly conducted in the field of ML or other statistical methods. However, conventional ML often requires engineered features to be derived from the raw data, namely the point clouds, to then learn the classifier. Using deep learning, the need for specially developed features is reduced or even eliminated, because the networks automatically extract features from the learned representations of the raw input data (LeCun et al. 2015). However, studies using deep learning models on segmented individual trees from lidar data to classify their species have not yet been widely explored. Most significantly, this applies to cases in which approaches act on the lidar points themselves instead of discretizing them into two- or three-dimensional representations, namely pixels or voxels. Studies employing deep neural networks (DNNs) to classify individual trees on an object level in combination with aerial lidar data are presented in the following.

Hamraz et al. (2019) used two airborne lidar acquisitions captured in Robinson Forest in leaf-on and leaf-off conditions to classify single trees into coniferous and deciduous species. For both seasons, they generated digital surface models from the upper canopy layer and an image stack compiled by nadir and side views of the single trees. In total, 124 coniferous and 2404 deciduous trees were available for training. These were then fed into a two-dimensional CNN model. Overstory trees reached classification accuracies of 92.1% for coniferous and 87.2% for deciduous trees. In contrast, understory trees were not as strongly identified, with accuracies of 69.0% and 92.1% for coniferous and deciduous trees, respectively.

By combining lidar data with high-resolution multispectral satellite imagery, Hartling et al. (2019) classified eight different tree species—of which seven were deciduous and one coniferous—in an urban environment. They generated intensity images from the lidar data and used these in combination with the panchromatic and infrared bands from WorldView 2 and 3 images. Training a DenseNet (Huang et al. 2017) classifier on the image patches of single trees—a total of 1552 tree samples for training—yielded an overall accuracy (OA) of 82.9% and an average accuracy of 80.9% for all eight classes.

Mustafić and Schardt (2019) used airborne laser scanning (ALS) data from a forest near Burgau, Austria, with a relatively low point density (15 pts/m\(^{2}\)). They examined three tree classes: pine, spruce, and deciduous trees (consisting of beech, birch, oak, ash, and alder). Each of these classes contains exactly 500 trees. The local maxima from a Canopy Height Model (CHM) were used as positions for tree candidates. Using different radii around those points, the single trees were approximated from the point cloud, and the neural network found the most fitting size. They further reduced the single trees into two dimensions by projecting the trees onto a plane and encoding the orientation of the points as color. Additionally, a set of one-dimensional representations (histograms and percentiles) was used as input for the DNN. Their proposed method of using multiple CNNs reached a mean OA of 74%, which was higher than their comparative experiments using transfer learning with InceptionV3 (Szegedy et al. 2016) and VGG16 (Simonyan and Zisserman 2015).

The study conducted by Sun et al. (2019a) in a wetland ecosystem in south China aimed at classifying 18 different tree species. The authors delineated single trees using the CHM generated from the lidar point cloud. Image patches for each tree were extracted and used for training three CNNs. To increase the number of training samples, the image patches were randomly rotated, mirrored, and flipped. This data augmentation resulted in 5664 samples for the training process. VGG16, AlexNet (Krizhevsky et al. 2012), and ResNet50 (He et al. 2016) were adjusted to take image patches of size \(64\times 64\) as input and then trained on their dataset. The best result was obtained using VGG16, with an OA of 73.55% and a \(F_1\)-score of 100% in two classes. On the same study site, Sun et al. (2019b) classified six tree species using a similar approach. Single trees were identified in the lidar CHM and used to crop multispectral images into patches of \(64\times 64\) pixels. By training the ResNet50 model on these patches, an OA of 89.8% was achieved.

Li et al. (2020) generated aerial view intensity images from ALS data of urban settings. These were combined with high-resolution multispectral WorldView 2 images, resulting in an 11-band composite. Segmented single trees (maple, locust, pine, and spruce) were classified. From this, 1052 individually identified trees were used for training, and through data augmentation (i.e. rotation and mirroring), a total of 6312 images were generated. Three DNNs—DenseNet40, ResNet18, and a 7-layer CNN—were compared in their classification performance, resulting in a maximum OA of 88.9% using the ResNet.

Individual trees comprising three species with additional standing dead trees were successfully classified by Briechle et al. (2020). In this study, the point cloud was not discretized into a lower dimensional space, namely two-dimensional images. Instead, the point-based PointNet++ (Qi et al. 2017) was applied to the point clouds of the single-segmented trees. In the learning process point clouds of 464 trees were used to train the classifier. An OA of 90.2% was reached, with the additional integration of features derived from the echo width of the laser sensor and features calculated from a multispectral orthophoto.

Mäyrä et al. (2021) used a three-dimensional CNN on hyperspectral images (a total of 250 bands) with trees delineated from a lidar CHM. Here, 2291 trees were used to train their proposed network from scratch. Their proposed network managed to differentiate four tree species (birch, pine, aspen, and spruce) with an OA of 87.0%. Like in most studies, the lidar point cloud was solely used for tree segmentation, whilst the classification was conducted with optical imagery.

In a recent study, Seidel et al. (2021) projected the point clouds of individual trees onto multiple planes to form a two-dimensional representation. They generated 8,100 images from point clouds of 690 different trees, of which 6720 were used in the training process. They trained a multilayered CNN on these images to learn to differentiate between seven different tree species. The OA in this experiment reached 86.0%. Interestingly, PointNet (Qi et al. 2017) was used on the raw point clouds, but could not reach competitive results, because most trees were falsely predicted.

Recently, Liu et al. (2021) published a custom DNN LayerNet designed for tree species classification. They used ALS data from a forested area in Saihanba National Forest Park (China) to classify birch and larch trees. With a combined segmentation approach, single trees were delineated from the point cloud. Their proposed LayerNet was then trained and evaluated on a set of 1200 trees. Unlike the other presented studies, no multispectral data were used in addition to the point clouds. The authors reported an OA of 88.8%, which was slightly better than the OA of 86.7% achieved using PointNet.

Briechle et al. (2021) developed a DNN model using a multi-view-based approach for classifying trees, called Silvi-Net. As a baseline method, the point-based network PointNet++ Qi et al. (2017) was used. Both networks were applied on the point clouds of pre-segmented trees from two study sites. In the Chernobyl Exclusion Zone study area, four tree classes were differentiated (the same as in Briechle et al. 2020). With PointNet++, an OA of 84.8% was achieved. In contrast, Silvi-Net achieved an OA of 96.1% using lidar and multispectral data. For the second study area, Bavarian Forest National Park (BFNP), OAs of 89.3% and 91.5% were achieved with PointNet++ and Silvi-Net, respectively. Notably, this study also used the same dataset as described in Sect. 2.

Deep learning approaches for tree species classification from lidar point clouds show promising results. However, most studies use CNN approaches based on images produced from the point clouds. A few studies exist which apply Deep Neural Networks (DNNs) that directly exploit the three dimensional geometry of the point clouds. Moreover, a challenge in this field is the lack of access to large labeled datasets for training.

1.3 Objectives

Until recently, lidar was mostly used for individual tree segmentation and not as input data for deep learning models. Only a few studies have employed raw point cloud data and their underlying geometry in the applied deep learning networks. However, lidar data contain more information about forest structure compared to optical imagery. The laser beam penetrates the upper canopy and captures information about the understory vegetation and smaller trees. In this paper, we therefore demonstrate the accuracy performance of two DNNs to classify tree classes (i.e., species and standing dead trees) by learning on three-dimensional point clouds. Both networks use different approaches to handle the underlying geometric information provided by pre-segmented trees. Moreover, the effects on the classification accuracy are tested if further features are used, such as the intensity of the laser beam and multispectral information. Finally, the results are compared with a study that used the same data but relied on an image based approach.

2 Materials

2.1 Study Area

Fig. 1
figure 1

Overview of the study area located in Bavarian Forest National Park. On 18.08.2016, two transects were flown with a lidar scanner attached to a helicopter, as presented in Sect. 2.2 The two transects of the lidar acquisition are marked in red

The study area for the present research is in the BFNP near the German–Czech border (see the two red transects in Fig. 1). Since its establishment in 1970, the national park has been governed by the guiding principle of ‘letting nature be nature’ (Nationalparkverwaltung Bayerischer Wald 2021), which applies to its whole area of 24,250 hectares. The tree species that mainly populate the forest are Norway spruce (Picea albies), European beech (Fagus sylvatica), silver fir (Abies alba), and larch (Larix). After severe storms and windthrow in the years 1983 and 1984, the forest struggled with a bark beetle infestation (Bibelriether 1989, as cited in Müller et al. 2008). This mass propagation—called a gradation—was not contained, as the policy was to avoid human intervention under almost any circumstances. Long periods of drought and mild climate favored a second gradation in the early 1990s, which still continues to date (Heurich et al. 2001, as cited in Müller et al. 2008). Consequently, more than 5000 ha of the spruce stand had already died by 2007 (Lausch et al. 2011). These natural stresses resulted in large amounts of standing dead wood.

The tree species classified in this study comprise deciduous and coniferous trees, as well as standing dead trees with crown, hereafter referred to as dead tree. The fourth class is called snag, which also refers to a standing dead tree but without a crown and missing most of its branches (Yao et al. 2012).

2.2 Data Acquisition

Table 1 Sensor equipment for the lidar data capture (after Amiri et al. 2019)

On 18th August 2016 , during the leaf-on season, full-waveform lidar data were acquired in the study area using a Riegl LMS-Q680i scanner carried by a helicopter. Two transects were flown with a 50% overlap (see Fig. 1). A flight speed of 25 m/s at an altitude of 400 m above ground resulted in a maximum point density of 80 points/m\(^{2}\). Note that laser reflectance values were normalized w.r.t. travelling distance (Amiri et al. 2019). The summarized key data of the flight are shown in Table 1.

During the same period in June 2016, multispectral aerial photographs were captured using a Leica DMC III camera. True orthophotos were generated using a digital surface model calculated from the lidar data. Further details are shown in Table 2.

Table 2 Sensor equipment for the multispectral data capture (after Briechle et al. 2021)

2.3 Data Pre-processing

2.3.1 Segmentation of Lidar Point Cloud

Individual trees were delineated using Normalized Cut segmentation (Shi and Malik 2000) and the software package Treefinder (see Krzystek et al. 2020). The quality of the segmentation was assessed through a visual inspection. An example of the segmented single trees is shown in Fig. 2a. Two-dimensional enclosing polygons were calculated from the point clouds of the delineated trees. The class assignment was done manually by visual interpretation of the individual tree point clouds and the corresponding representation in the aerial photograph. Incorrect segments were not taken into account in this process. The classes dead tree and snag were differentiated by subjective perception (Briechle et al. 2021). Shape differences between the four tree classes are exemplarily shown in Fig. 2b.

Fig. 2
figure 2

a Single trees segmented by normalized cut. Different colors represent identified individual trees. b Point cloud examples of the four tree classes from left to right: coniferous, deciduous, dead tree, and snag

To achieve a constant number of points per object, each tree was up-/down-sampled to 1024 points. Samples with fewer points were upsampled by insertion of random copies of points. Trees with more points were reduced to this number by randomly choosing 1024 points out of the point set. Next, the 3D points of each tree were normalized to have a mean of zero, i.e., centered around the origin, and uniformly scaled to fit into a unit sphere. This led to coordinates in the range of \([-1, 1]\) for all three dimensions.

2.3.2 Generation of Multispectral Features

The Normalized Difference Vegetation Index (NDVI) (Rouse et al. 1974) (cf. Eq. 1) was calculated from the color infrared images

$$\begin{aligned} \mathrm {NDVI}=\frac{\mathrm {NIR} - \mathrm {R}}{\mathrm {NIR} + \mathrm {R}}. \end{aligned}$$
(1)

All pixels of each tree were extracted using the enclosing polygons generated from the segmented single trees. Because the trees differ in size, each enclosing polygon can overlap with a different amount of pixels. It is favorable to have a constant number of feature inputs for deep learning. Thus, the pixel values are aggregated in 12 object-based descriptive statistics. In detail, the minimum (\(\min\)) and maximum (\(\max\)) values, the range (\(\max - \min\)), mean, standard deviation, mode, skewness, and kurtosis comprised eight features. Additionally, the 25th, 50th, 75th, and 90th percentiles (perc25, perc50, \(\ldots\)) were computed and normalized. The five most important features for a classification were then identified using a random forest-based feature extraction, which was performed by Briechle et al. (2021). The five final features were min, mean, perc25, skewness, and range. These final features were also used to make a one-to-one comparison with the results of Briechle et al. (2021). In the following, these features will be referred as to MS.

Finally, surface normals for all points were estimated using the estimate_normals function from the open source library Open3D provided by Zhou et al. (2018) (see also Briechle et al. 2021). The default settings were used, meaning that the 30 nearest neighbors are used to perform the normals estimation.

2.3.3 Splitting the Dataset

The dataset containing the segmented trees was divided into three subsets (train, validation, test), comprising 51%, 22%, and 27% of the data, respectively. Note that the number of samples for each class was balanced in the train dataset to avoid bias effects (see Table 3 for details).

Table 3 Number of samples in the dataset

2.3.4 Standardization of Laser Intensities

The laser intensities with a value ranged from 0 to 3694 were transformed using a power transformation. LeCun et al. (1998) showed that normalizing the input data is advantageous for a faster convergence. They suggest that all input data should have a mean close to zero and almost the same covariance. We used the two-parameter Box–Cox transformation, as proposed by Box and Cox (1964), which is defined by a pair of transformation parameters \(\lambda =\left( \lambda _1, \lambda _2\right)\). The function (Eq. 2) holds for \(y>-\lambda _2\). Because the values of the intensities were greater than or equal to zero, the parameter \(\lambda _2\) was set to \(1\) and \(\lambda _1\) set to \(0\)

$$\begin{aligned} {\text {fbc}}(y, \lambda ) = y^{(\lambda )} = {\left\{ \begin{array}{ll} \frac{\left( y + \lambda _{2}\right) ^{\lambda _{1}} - 1}{\lambda _{1}} &{} (\lambda _{1} \ne 0),\\ \log \left( y + \lambda _2\right) &{} (\lambda _{1} = 0). \end{array}\right. } \end{aligned}$$
(2)

After a subsequent standardization using Equation 3, the data values were centered around zero and had a standard deviation of 1

$$\begin{aligned} z\mathrm {-score}(y)=y_{z} = \frac{y-{\bar{y}}}{s_y}; \end{aligned}$$
(3)

Figure 3a, b show the effect of the power transformation and the subsequent standardization.

Fig. 3
figure 3

Histograms of the laser intensities for all trees. The distribution of the laser intensities is transformed to approximate a normal distribution. This is achieved by applying a power transformation to the power law distributed intensities, thereby boosting classification performance. Note that the frequency scale is logarithmic. a Before any transformation or normalization. b After power transformation and subsequent standardization

3 Methodology

3.1 3D-Modified Fisher Vectors-Net (3DmFV-Net)

3DmFV-Net is a deep learning network proposed by Ben-Shabat et al. (2018) for object classification from three-dimensional data, namely point clouds. The main idea of the network is to represent the point cloud in another form and then convolve over it. The representation is made up of two components. First, a Gaussian Mixture Model (GMM) is applied to fill the space. The means of the single Gaussians, which have a uniform standard deviation in their dimensions, are placed on a three-dimensional regular grid. The second component is the representation of the data in the form of 3D-modified Fisher Vectors (3DmFV).

The proposed approach is denoted as a hybrid model, because it comprises a discrete grid that combines the more continuous character of the Gaussians with the 3DmFVs. This combination can be viewed as ‘soft voxels’, because they have fixed locations but no discrete borders.

3.1.1 Fisher Vectors

A set of points \(X=\left\{ p_t\in {\mathbb {R}}^D,t=1,\ldots ,T\right\}\) is given to denote the point cloud of a segmented tree. Each of these points lie in three-dimensional space \(D=3, p_t=\left[ x,y,z\right] '\). The set contains a total of \(T\) samples. A GMM is given with \(K\) components, in which each is described by its mean, or expected value, \(\mu\), its covariance matrix \(\Sigma\), and mixture weight \(\alpha\). Let then \(\lambda =\left\{ \left( \alpha _k, \mu _k, \Sigma _k\right) , k=1,\ldots ,K\right\}\) be the set of parameters of the GMM.

The likelihood of an observed point \(p_t\) being generated by the whole GMM is a weighted sum of single Gaussians \(u_k\)

$$\begin{aligned} u_\lambda (p_t)=\sum _{k=1}^K w_k u_k(p_t) \end{aligned}$$
(4)

with the condition that the sum of the weights equals one \(\sum w_k=1\) to be a valid probability density function. As each component is described by a weight \(\alpha _k\), these are normalized using the softmax function

$$\begin{aligned} w_k=\frac{\exp \left( \alpha _k\right) }{\sum _{j=1}^K \exp \left( \alpha _j\right) };\quad k=1,\ldots ,K. \end{aligned}$$
(5)

The posterior probability of a point \(p_t\) to be associated with the \(k\)th component is given by the soft assignment

$$\begin{aligned} \gamma _t(k)=\frac{w_ku_k(p_t)}{u_\lambda \left( p_t\right) }; \end{aligned}$$
(6)

Sánchez et al. (2013) described the Fisher Vector \({\mathscr {G}}_\lambda ^X\) formally as the normalized gradient of the log-likelihood. The normalization term \(L_\lambda\) is given by the Cholesky decomposition of the inverse Fisher Information Matrix (FIM)

$$\begin{aligned} {\mathscr {G}}_\lambda ^X=\sum _{t=1}^{T}L_\lambda \nabla _\lambda \log u_\lambda (p_t). \end{aligned}$$
(7)

For real discrete data, the soft assignment is sharply peaked, so the FIM is diagonal and the gradient statistics can be computed from three gradients \({\mathscr {G}}_{\alpha _k}^X\), \({\mathscr {G}}_{\mu _k}^X\), and \({\mathscr {G}}_{\sigma _k}^X\).

The Fisher Vector is then defined as the concatenation of the summed gradients with respect to the parameters \(\lambda\)

$$\begin{aligned} \begin{aligned} {\mathscr {G}}_{FV_\lambda }^X =&\left( {\mathscr {G}}_{\alpha _1}^X, \ldots , {\mathscr {G}}_{\alpha _k}^X,\right. \\&\left. {{\mathscr {G}}_{\mu _1}^{X}}', \ldots ,{{\mathscr {G}}_{\mu _k}^{X}}',\right. \\&\left. {{\mathscr {G}}_{\sigma _1}^{X}}', \ldots , {{\mathscr {G}}_{\sigma _k}^{X}}'\right) . \end{aligned} \end{aligned}$$
(8)

To not depend on the number of samples in each given set of points, the Fisher Vector is normalized by the number of sampled points

$$\begin{aligned} {\mathscr {G}}_{FV_\lambda }^X \leftarrow \frac{1}{T}{\mathscr {G}}_{FV_\lambda }^X. \end{aligned}$$
(9)

Ben-Shabat et al. (2018) expanded this concept to the three-dimensional space and added different symmetric functions. They used the sum as a possible feature generation function but also the \(\min\) and \(\max\) functions with respect to the parameters \(\alpha , \mu , \sigma\). They then constructed the 3DmFV as follows:

$$\begin{aligned} {\mathscr {G}}_{3DmFV_\lambda }^X = \begin{bmatrix} \left. \sum _{t=1}^T g_\lambda ^X \right| _{\lambda =\alpha , \mu , \sigma } \\ \left. \max _t\left( g_\lambda ^X \right) \right| _{\lambda =\alpha , \mu , \sigma } \\ \left. \min _t\left( g_\lambda ^X \right) \right| _{\lambda =\mu , \sigma } \end{bmatrix}. \end{aligned}$$
(10)

This vector is calculated for each of the \(K\) components of the GMM. In the case of \(D=3\), both \(\mu\) and \(\sigma\) are size 3 vectors, whilst \(\alpha\) is a scalar. This generates a 3DmFV of size \(D(3+3)+2=20\).

Subsequently, the 3DmFV is normalized twice, in accordance with Perronnin et al. (2010). These two normalization steps are a signed power normalization \(f_p(z)={\text {sign}}(z)|z|^{0.5}\) followed by a \(\ell ^2\)-normalization.

3.1.2 Gaussian Mixture Model

All tree segments are fitted into a sphere centered at the origin and a radius of one (see Sect. 2). The single GMM components are placed on a uniform 3D grid in the range of \(\left[ -1,1\right]\) in each dimension with a size of \(m\times m\times m\), with \(m\ge 3\) giving a total of \(K=m^3\) components. Each is assigned a mixture weight of \(\alpha _k={K}^{-1}\) and a uniform standard deviation in each direction of \(\sigma _k = {m}^{-1} \Rightarrow \Sigma _k=\sigma _k I\).

3.1.3 Inception Module

The main building block of the 3DmFV-Net is an adapted version of the Inception module. Szegedy et al. (2015) introduced the Inception module as a ‘network within a network’. The used structure is shown in Fig. 4.

Fig. 4
figure 4

Internal architecture of the Inception module as used in the 3DmFV-Net (per Ben-Shabat et al. 2018)

The sub-network consists of four 3D-convolution layers, each followed by a batch normalization (Ioffe and Szegedy 2015) and the ReLU activation function \(\sigma _\mathrm {ReLU}(x)=\max (0,x)\). There are three parameters for constructing the module: Two kernel sizes, \(c_1\) and \(c_2\), for two of the convolutions and a value \(N\) that corresponds to one-third of the output channel size. The inputs for both convolutions with the given kernel sizes \(c_1, c_2\) are zero padded to result in the same spatial size. Using even kernel sizes, the input is only padded to one side of the dimension. The output channel size \(N\) has to be an even number. After all four convolutions, the outputs are concatenated. There is no dimensionality reduction, due to the initial padding. To reduce the dimension size, max-pooling layers are employed (see Sect. 3.1.4). The structure of the used Inception module is shown in Fig. 4.

3.1.4 Model Architecture

The model architecture is dependent on the size of the chosen GMM. Each model comprises at least five Inception layers. After the third Inception module, a max-pooling layer is employed with \(m\ge 5\) . Only the model for \(m=16\) uses two more Inceptions as compared to the smaller GMMs. After the convolutional layers, the features are passed into a Fully Connected Network (FCN) to reduce the representation of the point cloud to the size the four classes. Table 4 shows the parameters of each layer for all the used GMM sizes.

Table 4 Architecture of the 3DmFV-Net for different GMM sizes

Three more sizes for the GMM are then tested. They are chosen, so that they are more granular in the \(Z\)-dimension, in that more Gaussians are placed in this dimension. This is because trees, especially the two main species in the dataset (Norway spruce, European beech), usually grow more in height than in diameter in densely covered forests. For this approach, the sizes \((3\times 3\times 9)\), \((5\times 5\times 10)\) and \((8\times 8\times 12)\) are chosen. Using these values for the \(Z\)-dimension, the model structure, especially with regard to kernel sizes, does not have to be adjusted.

3.2 PointCNN

PointCNN was introduced by Li et al. (2018) and is designed, among other things, for object classification of point clouds. It is a convolutional-based approach for learning on irregular data. PointCNN transforms the geometry of point clouds into a latent feature space through its \(\mathcal {X}\)-Conv transformation.

A point cloud is formally described as a set of points lying within a metric space. Each point can contain none, one, or multiple features. In the case of single trees, they include the laser intensities, normals, and multispectral features as described in Sect. 2.

The input to PointCNN is a set of \(N\) points in a \(d\)-dimensional space

$$\begin{aligned} \left\{ p_{i}:p_i\in {\mathbb {R}}^d;~i=1,\dots ,N\right\} . \end{aligned}$$
(11)

Each of these points has features associated with it

$$\begin{aligned} \left\{ f_i:f\in {\mathbb {R}}^{C};~i=1,\dots ,N\right\} \end{aligned}$$
(12)

a position in a \(C\)-dimensional feature space or alternatively a total of \(C\) features. It is also possible that no features are associated with the points, i.e.. only the geometry of the point cloud is given. This would mean that \(C=0\) and result in the feature set being an empty set \(\emptyset\). However, this has no influence on the learning procedure of the \(\mathcal {X}\)-Conv, as described further in Sect. 3.2.1. Together, they form the input point cloud \({\mathbb {F}}_1\)

$$\begin{aligned} \begin{aligned} {\mathbb {F}}_{1} = \left\{ \left( p_{1,i}, f_{1,i}\right) : i = 1, 2, \ldots , N_{1};\right. \\ \left. p_{1,i} \in {\mathbb {R}}^{d}; f_{1,i} \in {\mathbb {R}}^{C_1} \right\} . \end{aligned} \end{aligned}$$
(13)

As this point cloud passes through the \(\mathcal {X}\)-Conv  module, the number of channels (i.e., the dimensions of the feature space) may increase. The increased number of feature dimensions no longer represents tangible features, such as laser intensity, but a more abstract representation of small neighborhood patches and their interactions. These representations are learned by the PointCNN. The resulting point cloud is described for \(C_2\ge C_1\) as

$$\begin{aligned} \begin{aligned} {\mathbb {F}}_{2} = \left\{ \left( p_{1,i}, f_{2,i}\right) : i = 1, 2, \ldots , N_{1};\right. \\ \left. p_{1,i} \in {\mathbb {R}}^{d}; f_{2,i} \in {\mathbb {R}}^{C_2} \right\} . \end{aligned} \end{aligned}$$
(14)

A step usually taken when conducting deep learning on spatial data, such as images and point clouds, is the reduction of the spatial domain, thereby enriching the feature space. In the traditional convolutional layers, this happens automatically by choosing the right kernel sizes. Also, almost all convolutional neural networks employ a form of pooling (Goodfellow et al. 2016), which enables the reduction of the spatial size. In this case, the \(\mathcal {X}\)-Conv outputs a point cloud with the same spatial extent and, hence, needs to be down-sampled (see Sect. 3.2.1).

3.2.1 \(\mathcal {X}\)-Conv

Fig. 5
figure 5

Internal architecture of the \(\mathcal {X}\)-Conv module used in PointCNN

The input to the \(\mathcal {X}\)-Conv operator is an unordered set of points. Moreover, there are four parameters in three groups for the module. The first two parameters are channel sizes for the hidden \(C_H\) and output \(C_O\) layers. The third parameter is the size of the convolution kernel \(k\), which is also the number of neighbors around each point in the \(k\)-Nearest Neighbor (\(k\)-NN) search. The final parameter, a dilation multiplier \(d\), is introduced to broaden the neighborhood of the points.

Starting with a specific point \(p\), the \(k \cdot d\) nearest neighbors are identified by a \(k\)-NN search, such as 16 for \(k = 8\) and \(d = 2\). From these neighbors, \(k\) are randomly chosen, whilst the remaining ones are omitted. If the dilation parameter is one, all points are chosen. This set of points is denoted as \(P_k\), with the corresponding features of these points in \(F_k\).

In the next step, the neighborhood points \(P_k\) are translated into a local coordinate system with the central point \(p\) at the origin \(\left( P_k - p \right)\). This eliminates dependence on absolute coordinates, because only the relative position of the points to each other is relevant.

Subsequently, the shifted points are passed through two different MLPs. The first, \({\text {MLP}}_\delta (\cdot )\), is applied to each point individually to lift them into a latent—and potentially canonical—feature space, which is a more abstract representation, similar to PointNet (Qi et al. 2017). The abstract neighborhood representation \(F_\delta\) has its dimensionality given by the hidden channel parameter \(C_H\). These features are then concatenated with the given features \(F_k\), resulting in \(F_*\), which later will be transformed. The second MLP, labeled as \({\text {MLP}}[k\times k]\) in Fig. 5, learns a transformation matrix of size \(k \times k\) across the whole neighborhood. This transformation matrix \(\mathcal {X}\) permutes and weights the features \(F_*\) by matrix multiplication.

Finally, these transformed features are convolved by a standard one-dimensional convolution. The kernel \(K\) has size \(k\) and is learned during the training of the neural network. The resulting features with output size \(C_O\) replace the previous features of the point cloud, thereby retraining the points’ spatial positions. Mathematically, this result is described in Eq. 14. This can then be passed through further point sampling, pooling, or \(\mathcal {X}\)-Conv layers in the network.

All components of the \(\mathcal {X}\)-Conv, such as both MLPs, the convolution \(\mathrm {Conv}(\cdot )\), and the matrix multiplication, are differentiable. Therefore, the whole \(\mathcal {X}\)-Conv is differentiable and suitable for training a DNN with backpropagation.

Finally, iterative farthest point sampling as described by Qi et al. (2017) is applied to reduce the input \(x_1,x_2,\ldots ,x_n\) to a smaller subset in an iterative manner. This ensures, together with the \(k\)-NN, that the receptive field of the following layers can grow.

3.2.2 Model Architecture

Li et al. (2018) provided a model zoo for different 2D and 3D benchmark datasets. The architectural design of the used model is based on the one used for the ModelNet40 (Wu et al. 2015) benchmark.

Fig. 6
figure 6

a PointCNN model architecture with MS as a point feature. b PointCNN model architecture with MS as an object feature

Figure 6a shows the general structure of PointCNN. The input to the model is a point cloud, which in this case is a single-segmented tree, comprising 1024 points in three-dimensional space. Further features can be assigned to each point. These include the laser intensity, normals, and multispectral features, as described in Sect. 2. Any combination of these, or none at all, can be used for the input. If none are chosen, the input represents the geometry of the point cloud. The first two \(\mathcal {X}\)-Conv layers are each followed by a down-sampling layer to reduce the spatial dimensionality. Both subsequent convolution layers maintain the spatial size but increase the number of learned features. After four \(\mathcal {X}\)-Convs, the learned representation of a tree is further reduced via global mean pooling, which takes the mean of each feature across the whole object. Simply put, the whole point cloud is compressed into a single point with no location, which has 384 associated features. This embedding is then passed through three fully connected layers to produce an assignment probability for each of the four classes.

As the multispectral features (MS) are derived per tree, they describe the complete object rather than each point individually. Instead of assigning each point the same features, they are concatenated to the pooled features of the last \(\mathcal {X}\)-Conv layer. These learned features can be seen as descriptors for the whole object and are therefore in line with the MS features. Each point in the input can still have a combination of features attached to it. Here, the available choices are still the laser intensity and the normals. The adapted model structure is depicted in Fig. 6b. The last \(\mathcal {X}\)-Conv outputs five fewer features to compensate for the MS features. Moreover, the first fully connected layer keeps the same input size compared to the aforementioned model.

4 Experimental Setup

4.1 3DmFV-Net

The values of the 3DmFV are dependent on the absolute positions of the points in space rather than their relative positions to each other. Therefore, the point clouds of each single tree are centered around their center of gravity, i.e., the mean of their points. Furthermore, they are scaled to fit into a sphere with radius 1, as described in Sect. 2.3.1. Thus, the trees do not necessarily span the whole range of \([-1, 1]\) in their \(Z\)-dimension. The height of the trees in terms of normalized coordinates is not dependent on the actual height of the tree, but rather where the majority of the returning laser pulses were registered. Therefore, the point cloud of each individual tree was shifted and scaled, thereby spanning from -1 to 1 in the \(Z\)-dimension. This may cause the values in the other two dimensions to not lie within the range \([-1,1]\). The points that exceed this range are still captured in the 3DmFVs, because the GMM extends beyond it (see Sect. 3.1.2).

Moreover, the training set was augmented by inserting three more copies of each tree. These copies were rotated by \(90^\circ\), \(180^\circ\), and \(270^\circ\), respectively. This behavior was as expected, because the 3DmFV are dependent on the absolute positions of the points. This means that the network is not rotation invariant with the exception that expected rotations of the objects are also learned.

The experiments involving the 3DmFV-Net differ mainly in the chosen grid size of the GMM. Ben-Shabat et al. (2018) suggest four different grid sizes and their according model architecture, as presented in Sect. 3.1.4. In this paper, these proposed sizes and their corresponding model architecture were employed. We simulated both uniform (denoted as \(m\times m\times m\)) and non-uniform grids (denoted as \(m\times m\times n\)). Parameter \(m\) is the base size of the grid and \(n\) is the changed grid size in the \(Z\)-dimension with the condition \(m<n\). This is because trees were sampled more finely in their height and, thus, their characteristics in this dimension are captured better. For the base sizes of 3, 5, and 8, the number of sections for the third dimensions is set to 9, 10, and 12, respectively.

4.2 PointCNN

In contrast to the 3DmFV-Net, PointCNN is not dependent on the absolute positions of the point clouds. By design, this network is also capable of integrating more features than only the geometry (see also Sect. 5.2). The experiments for PointCNN used two slightly different architectural designs. First, a model which uses the geometry of the point clouds with none, or a combination of features (see Fig. 6a). In this approach, the MS features can be assigned to every point in each point cloud, although these features are describing the whole object. The second architecture is, for the most part, equivalent to the first one. Instead of using the MS features as point features, they are treated as object features and thus concatenated to the learned representation right before the FCN (see Fig. 6b). For both of these designs, the combinations of inputs are experimented with. All experiments use the geometry and none or up to three features. All configurations have in common that they always use the geometry GEOM of the point cloud. In total, 16 model configurations were tested.

4.3 Training of the Networks and Hyperparameters

Both networks were trained on the same dataset with exactly the same train/val/test split. The training was carried out on a workstation with 32 GB RAM and a NVIDIA Quadro RTX 4000 with 8 GB GPU memory. The number of samples per batch for PointCNN and 3DmFV-Net was 16 and 64 respectively. In each epoch, the samples for the mini-batches were shuffled. To prevent the model from overfitting, an early stopping was applied. If there is was no improvement in the validation loss within 20 steps, the learning process was stopped. The model parameters at the epoch with the lowest validation error were considered the best for a generalized model.

The hyperparameters used to train both networks are shown in Table 5.

Table 5 Hyperparameters used in the learning process for PointCNN and 3DmFV

4.4 Accuracy Metrics

To quantify the performance of our models, we used the manually assigned class labels (i.e., true values) and the classified labels to calculate confusion matrices along with values for OA and the \(F_1\)-score using recall and precision.

5 Experimental Results

5.1 Main Outcomes

In our experiments, we demonstrate the performance of two DNNs, 3DmFV-Net and PointCNN, to classify tree species and dead trees using lidar point clouds captured in BFNP. In terms of OA, PointCNN performs better, with an OA of 87.0%, in comparison to an OA of 75.1% for 3DmFV-Net (see Tables 6 and 7). Interestingly, if PointCNN is applied, additional features calculated from the lidar intensity and the normals do not significantly contribute to the classification results. Instead, the multispectral feature NDVI improves the results most notably. For both networks, we found apparent confusions between spruces and dead trees with crown.

5.2 3DmFV-Net

Table 6 Results of 3DmFV-Net with uniform and non-uniform grids
Fig. 7
figure 7

Confusion matrices for experiments with 3DmFV-Net using uniform and non-uniform grids. Subfigures af refer to uniform and non-uniform grids

Table 6 summarizes the best results of our experiments using 3DmFV-Net. Relevant confusion matrices are provided in Fig. 7. All in all, the \(m=8\) grid performs best with an OA of 73.2%. This grid size also reaches the highest \(F_1\)-scores among all tested 3DmFV-Net configurations for the two ‘living’ classes, coniferous and deciduous. Although these scores are relatively high, the score for the class dead tree is the lowest overall at only 35.2%. For this class, the error matrix in Fig. 7c shows that only 51 of the 139 objects are detected correctly. False negatives mainly occur in coniferous and snags at 38 and 53, respectively. Only 8 snags and 60 coniferous are erroneously classified as dead tree.

5.3 PointCNN

Table 7 Results of the PointCNN
Fig. 8
figure 8

Confusion matrices of selected PointCNN configurations

The first experiment consists of learning a PointCNN model using only the geometry of the point clouds. As shown in Fig. 6b, an OA of 69.7% is reached if the model structure is applied. This is in the same order of magnitude as 3DmFV-Net. Furthermore, this configuration reports the highest \(F_{1}\)-score for the class snag achieved by PointCNN in this study. The confusion matrix (Fig. 8a) shows confusions mainly between dead tree and coniferous. When adding the unnormalized intensities as a feature to each point, the OA decreases significantly to 61.5%. There is an increase of falsely predicted dead trees across all classes compared to the network that uses only the geometry, as shown in Fig. 8b (GEOM_I). This leads to coniferous and dead tree having the lowest recall and precision, respectively. Standardizing the intensities before the learning process results in an OA of 69.5%. This is an increase compared to the unnormalized intensities but not significantly different compared to that which uses only the geometry. Using the power-transformed intensities \(\mathrm{I}_{s}\), an increase in accuracy can be observed. This input reaches an OA of 71.5%. If compared to the GEOM configuration, the confusion matrix (Fig. 8d) shows that this brings an improvement in true positives for dead tree and coniferous, as well as less misclassification between them.

Employing the estimated normals N of the point cloud in the model GEOM_N as the only feature besides the geometry GEOM shows no increase in classification performance. The OA is lower than that using only the GEOM. Significantly more coniferous trees are predicted as dead tree, which also leads to a smaller \(F_{1}\)-score in the coniferous class (see Fig. 8e).

The biggest change in performance can be observed when using the multispectral features MS. When using a copy of these object features as an additional feature to each point (\(\mathrm{MS_{feat}}\)), an OA of 86.0% is reached. The recurring misclassifications between coniferous and dead tree have almost been eliminated (see Fig. 8f). Deciduous trees are detected almost perfectly with an \(F_{1}\)-score of 96.5%, the highest score in this study. As a result of these fewer misclassifications, coniferous and dead trees also show high \(F_{1}\)-scores of 91.0% and 72.3%, respectively.

The second possibility for using MS features is to provide them on an object basis. Instead of using a copy of the MS as a feature, these features are concatenated to those resulting from the \(\mathcal {X}\)-Conv layers. These are then passed through the FCN. This is the case for the second proposed model structure (as shown in Fig. 6b). Similar to the use of MS as point features, the combination of GEOM and \(\mathrm{MS_{FCN}}\) reaches an OA of 85.6%. The \(F_{1}\)-scores are all in the same order of magnitude as in the \(\mathrm{MS_{feat}}\) configuration. The comparison of their confusion matrices (see Fig. 8f, g) shows minor differences. Compared to \(\mathrm{GEOM\_MS_{feat}}\), \(\mathrm{GEOM\_MS_{FCN}}\) shows fewer true positives for coniferous but more in all other classes. There are 17 coniferous trees classified as deciduous, producing lower \(F_{1}\)-scores compared to \(\mathrm{MS_{feat}}\). Although the OA is lower than for the \(\mathrm{GEOM\_MS_{feat}}\) configuration, all tests confirm that using the MS as an object feature in combination with other features yields better classification results.

Further tests include input data with two types of features. Combining the laser intensities with the MS results in higher accuracy than using only the intensities as a single feature. The combination of power-transformed intensities \(\mathrm{I}_{s}\) and \(\mathrm{MS_{FCN}}\) produces the highest OA among all configurations at 87.0%. This input configuration also results in the highest \(F_{1}\)-score for coniferous (91.5%) and dead tree (76.5%) among the PointCNN experiments. The confusion matrix (see Fig. 8i) shows 236 correctly identified coniferous trees and little confusion in the other classes. This is especially true for dead tree. Moreover, the confusion between coniferous and dead tree, which is the most prominent confusion in the experiments, is relatively small (13 and 12 misclassifications, respectively). With a total of 44 misclassifications, the most falsely predicted trees result from the snag class and are assigned to dead tree. Compared to the best configuration, the combination of \(\mathrm{I}_{z}\) and \(\mathrm{MS_{FCN}}\) performs slightly worse in terms of OA (= 85.4%) and relevant \(F_{1}\)-scores (see also confusion matrices in Fig. 8h, i).

The remaining results, shown in Table 7 (\(\mathrm{GEOM\_N\_I}_{s}\), \(\mathrm{GEOM\_N\_I}_{s}\_\mathrm{MS_{feat}}\), \(\mathrm{GEOM\_N\_I}_{s}\_\mathrm{MS_{FCN}}\)), demonstrate that normalizing the intensities generally shows an increase in accuracy, whilst the addition of the normals impairs the results. The biggest change in classification performance results from adding MS features; this produces a significant increase in OA and distinction between classes.

6 Discussion

6.1 3DmFV-Net

As shown in Sect. 3.1, the 3DmFV-Net uses a hybrid approach. In the experiments, the main difference is the choice of grid size for the GMM, as the design of the network only allows the use of the point clouds’ geometry.

The smallest grid has a base length of three and performs well in detecting deciduous trees, but struggles to keep the other three classes apart from each other. When the base size increases to five, the OA decreases slightly. When the base size increases to eight, the OA and all \(F_1\)-scores reach the highest values in this set of experiments (see Table 7). This could suggests that a finer grid—thus a larger base size—will increase accuracy. However, the s with sizes of 10 and 12 show that this is not the case. Rather, a grid size of eight performs best in terms of OA and \(F_1\)-scores.

In conclusion, a grid of \(8\times 8\times 8\) with data augmentation in the training set is optimal. However, this network could still be improved by more extensive hyperparameter tuning, the introduction of other symmetric functions of the 3DmFV, or more sophisticated data augmentation. The network could also be tuned by integrating other features in the network design, such as multispectral information, which guaranteed a major improvement in classification accuracy in the PointCNN.

6.2 PointCNN

When using the different inputs (e.g., laser intensities, normals, and multispectral features), some different effects on classification can be observed. The addition of the intensities as features diminishes classification performance significantly (OA of 61.5% compared to 69.7% with only the geometry; see Table 7). This is influenced by the power law distribution of the values. As expected, after power-transforming and standardizing, the intensities no longer reduce classification accuracy. Interestingly, the OA also does not increase in a significant way. This indicates that the network cannot derive any meaningful information from the laser intensities. The sole standardization of the intensities seems to be sufficient.

The normals of the points are another feature that can be used as an input in combination with the geometry of the point cloud. The results of the experiments show that the normals usually do not improve the classification significantly. This is likely due to the rough and irregular surfaces of each tree’s point cloud. The best performing feature combination \(\mathrm{GEOM\_I}_{s}\_\mathrm{MS_{FCN}}\) does not include the normals. Future experiments can therefore leave these out and reduce computational effort.

The third feature is a composite of the descriptive statistics derived from the multispectral orthophotos. Adding these to the network input, either as point-based features or object-based features, improves the classification results significantly. The OA reaches 86.0% and a nearly perfect \(F_1\)-score of 96.5% for deciduous trees in combination with the geometry (see \(\mathrm{GEOM\_MS_{feat}}\) in Table 7). These results are achieved by inserting a copy of the object feature to the feature ensemble of each point carrying redundant information. However, the results of the other experiments show that using the MS features in the FCN generally performs slightly better compared to the feature approach. The best-performing input configuration \(\mathrm{GEOM\_I}_{s}\mathrm{\_MS_{FCN}}\) also uses this model architecture. The sole use of the MS in combination with the geometry already performs nearly as well as the best model. Unexpectedly, using intensities only marginally increases classification accuracy, whilst the use of normals results in a small decrease.

6.3 Misclassifications

Both networks, PointCNN and 3DmFV-Net, show similar patterns in their classification results—especially visible in the confusion matrices—when not using the MS features. Generally, the deciduous class was classified very consistently with high \(F_1\)-scores \(\left( > 80\%\right)\), and thus was well distinguished from the other classes. Most misclassifications occur through falsely assigning coniferous trees to the dead tree class, and similarly vice versa, independent of the used network (cf. Figs. 7 and 8). These false negatives were most apparent in the 3DmFV-Net configurations but also in the PointCNN when the geometry was combined with the unnormalized laser intensities (see GEOM_I in Fig. 8b).

The BFNP is infested with bark beetles, which mainly kill spruces, the main species in the coniferous class. These dying spruces constituted the majority—if not all—of the samples in the dead tree class. Because they still have their crown, they are geometrically very similar to their living conspecifics. This often leads to misclassification between these two classes. The MS features, derived from the NDVI, mostly eliminates these erroneous classifications, however. Hence, the difference between living and dead trees is best described through these features. In other words, multispectral information captured by an optical sensor is necessary to gain better discrimination between dead and living spruces.

Another pattern that occurred in the aforementioned classifications and their error matrices was confusion between dead tree and snags. When a tree is dying, such as because of the bark beetle infestation, it continuously loses its tree crown and branches. It is considered a snag when the tree has lost a substantial amount of its branches. The distinction between dead tree and snag in the present study was based on the subjective perceptions of three research assistants (Briechle et al. 2021). This may have been the reason for the observed difficulties in distinguishing the two classes in the two deep learning networks. For example, false positives were assigned to both classes. Furthermore, as the confusion matrices show, this effect could not be mitigated completely by the strategic use of the input features or grid sizes. However, some model configurations show better distinction in one area. The 3DmFV-Net \(8\times 8\times 8\) configuration (see Table 7) exhibited significantly less mislabeling from dead tree to snags compared to the other configurations of this network. Similarly, the PointCNN configuration \(\mathrm{GEOM\_I}_{z}\) showed fewer misclassifications from snags to dead tree. For future studies, either further features should be crafted to generate a measure for distinction, or a more objective way of labeling the training data should be developed. For future applications, it is also possible to use a threshold on the certainties of the classification. Tree samples resulting in low certainty values for each class could then be flagged and pushed to manual assessment.

6.4 Comparison to Related Work

We compare our results with the study of Briechle et al. (2021) that was not only conducted in the same study area but also used the exact same dataset. Therefore, a one-to-one comparison can be made here. PointNet++ was used as a baseline method for comparison to their proposed network. Similar to PointCNN, this DNN is also point-based. Considering only the geometric information from the point clouds of the single trees, PointNet++ reached an OA of 85.7% with \(F_1\)-scores of higher or equal to 72%. PointCNN reached an OA of 69.7% with low \(F_1\)-scores in the coniferous and dead tree class. 3DmFV-Net could also not perform that well with a maximum OA of 73.2% using solely geometric information. When further features were taken into account, i.e., laser intensity and multispectral features, PointCNN gains a rather large performance increase reaching 87.0% (+ 18.3% points) in classification accuracy. Such an increase in performance does not occur with PointNet++, which can gain 3.6% points for an OA of 89.3%. PointNet++ additionally shows higher \(F_1\)-scores throughout. However, both networks are in the same order of magnitude. This suggests that PointNet++ performs better on pure geometry data than PointCNN in this kind of real-world application. Furthermore, PointCNN seems to rely on further initial features to reach results on par with PointNet++.

In their work, Briechle et al. (2021) also proposed their own DNN. Silvi-Net uses a multi-view-based approach projecting the point clouds of individual trees onto 12 image planes. Moreover, image patches from the aerial orthophoto are created using the enclosing polygons of the trees. Employing only geometric features from the point clouds, this network was able to reach an OA of 84.7%, which is significantly higher than the 69.7% achieved by PointCNN and 73.2% of 3DmFV-Net. Similar to PointCNN, Silvi-Net experiences a decrease (\(-\,\)1.2% points) in classification accuracy when using the additional—not power transformed—laser intensities in the generated images. Interestingly, the same misclassification patterns, as discussed in Sect. 6.3, can also be observed with Silvi-Net. However, the network configuration using geometry, laser intensities, and multispectral images attained the highest classification performance. The OA was 91.5% that is higher than both PointCNN (87.0%) and the used baseline PointNet++ (89.3%). This suggests that multi-view approaches could perform better than point-based DNNs in tree species classification.

7 Conclusions

In this paper, the two DNNs PointCNN and 3DmFV-Net were investigated for the application to tree species classification using airborne lidar data. We demonstrated that both architectures are capable of classifying both living and dead trees simultaneously. 3DmFV-Net is limited to the geometry of the single trees, whereas PointCNN offers integration of further features by design. The geometric information from the point clouds is sufficient for the distinction between coniferous and deciduous trees. The synergetic use of normalized intensities and multispectral information improved classification accuracy significantly. This effect could not be seen with unnormalized intensities, or the standardized ones (cf. Table 7). Including the point normals as additional features did not lead to improved results. However, the features derived from a multispectral orthophoto proved to be crucial for differentiating between dead trees and their living counterparts. PointCNN even reached an OA of 87.0% that can be considered operational in practical applications. Both networks are possibly sensitive to the point count of each tree and to the point density of the lidar capture. In further studies, the impact of the point density on the accuracy performance could be evaluated. To this end, a second dataset would also be valuable to show the transferability of this study to other forest areas and tree classes.