1 Introduction

Reliable object manipulation procedures are a fundamental prerequisite for the robotic handling of parts in disassembly and remanufacturing. The literature on grasping and manipulation includes methods based on properties of the objects like their appearance and geometry [1, 2], or their dynamics [3]. Regardless of the method used, the shape of the target object usually needs to be estimated.

Object shape can be estimated from 2D camera images or 3D point clouds. Due to the variable appearance and state of used parts, and the loss of structural information, methods based on 2D images often fail to obtain acceptable results in remanufacturing applications [4, 5]. For this reason, 3D models are often preferred.

Thanks to the increasing availability of reliable and affordable sensors, 3D scans have nowadays become easily obtainable from the field. However, point clouds pose their own challenges due to their lack of topological structure, and the large amount of information they carry (usually millions of data points).

This study aims to investigate the ability of the PointNet deep neural network (DNN) [6] to recognise primitive shapes in point cloud models of everyday objects, after being trained on computer-generated geometric primitives. PointNet directly takes the elements of the point cloud as input, and is able to recognise objects irrespective of their position and orientation. PointNet can also be trained to segment parts and sub-assemblies from the point cloud scene. These features make it an ideal candidate for object recognition in a highly unstructured domain such as the disassembly and manipulation of end-of-life products. They also distinguish PointNet from standard DNNs, particularly those based on convolutional layers, since the latter require structured input data, and their internal representation of the input is generally not rotation invariant [6].

Zheng et al. [7] used CAD-generated models to train PointNet to identify complex mechanical parts for disassembly applications. Experimental evidence indicated the viability of the proposed approach, although the accuracy of the trained DNN was validated on artificial scenes created via a depth-camera simulator. This study aims to evaluate the ability of PointNet to recognise object shapes from real scans of objects, after being trained on artificial geometric models. The focus of this study is also on the abstraction of primitive shape information, rather than the recognition of detail-rich objects like car turbocharger components [7].

The research has direct application to many engineering problems beyond the disassembly and remanufacturing domain, since mechanical objects have often fairly regular shapes, which can be approximated with geometric primitives such as spheres, boxes, and cylinders (e.g. the cylindrical head of a piston, the spheres of a rolling bearing) [8, 9].

This study also aims to compare the performance of PointNet to the performance of simpler classifiers like shallow neural networks (SNNs). Provided a simple method is available to extract a structured and meaningful representation of the point cloud scenes, usually a set of features, SNNs are preferable for their comparable ease of training and low computational overheads. In this study, the performance of PointNet was compared to the performance of two popular shallow neural networks: a multi-layer perceptron (MLP) [10] and a radial basis function network (RBFN) [11].

The main difficulty in the recognition task comes from the fact that the shape of the scans is often not perfectly regular. Sensor imprecision and occlusion (partial view) further complicate the problem. In this study, the performance of PointNet was tested on real scans of common objects from the Yale-CMU-Berkeley (YCB) benchmark set [12], a popular robotics benchmark.

The use of real scans of physical objects constitutes a more realistic setting than the CAD-generated images used in the tests performed by the creators of PointNet [6], or the simulated scans used by Zheng et al. [7]. The fact that PointNet had not been evaluated on real-life point cloud model sets was first pointed out by Garcia-Garcia et al. [13], and later acknowledged by Uy et al. [14] who manually built the ScanObjectNN set. ScanObjectNN contains camera scans of physical objects grouped in categories modelled on the popular ModelNet40 benchmark set of CAD models [15]. In their study, Uy et al. [14] reported very poor classification accuracy (32.2%) when the the PointNet was trained using ModelNet40 and tested on the ScanObjectNN set.

The objects used in this study have a more regular shape than those featured in the ModelNet40 and ScanObjectNN sets. PointNet will be trained using a set of geometric primitive shapes, and then used to recognise similar shapes from real-life scenes. The obvious advantage of this arrangement is the possibility of generating an arbitrarily large model set for training the DNN, removing the need of acquiring a database of object scans.

This paper is organised as follows. Related work is discussed in Sect. 2, whilst the PointNet deep architecture is described in Sect. 3. Section 4 describes the SNN architectures, and presents the extraction scheme generating the features they use. The model sets used in the experiments are described in Sect. 5. The experimental setup and results are reported in Sect. 6, whilst the outcomes of the tests are discussed in Sect. 7. Section 8 concludes the paper.

2 Related work

Deep neural networks have gained wide popularity for 2D machine vision applications, thanks to their high accuracy and feature extraction ability [16]. In recent years, DNN-based vision technology found increasing application in the fields of manufacturing [17,18,19] and remanufacturing [20,21,22,23,24,25].

In detail, Yildiz and Wörgötter [24, 25] developed a screw detection and classification system based on a deep convolutional neural network, and demonstrated its accuracy in a hard disk drive disassembly case study. The creation of the training set of examples for the DNN entailed a large effort, where 20,000 sample images of 500 screw elements were collected from 50 hard disk drives. Foo et al. [21] used deep learning for screw detection in an LCD monitor disassembly application. The system used an image preprocessing procedure, an ontology reasoning module, and a Fast-RCNN network [26]. The training dataset was built combining numerous images of screws acquired via an extensive Google search, plus 356 manually acquired images. A total of 1496 bounding boxes around the screw samples had to be manually labelled in the images. Li et al. [22] used a fast region-convolution neural network to detect screws on motherboards of mobile phones for disassembly. The training procedure needed the manual acquisition of 488 images.

Brogan et al. [20] proposed a vision system based on the Tiny YOLO v2 (Tiny-You Only Look Once v2) pretrained DNN architecture, to identify screws on electrical waste for disassembly. The system achieved over 92% recognition accuracy using 900 manually collected training images. A YOLO (v4) architecture was used also by Rehnholm [23] to build the vision system for a battery package disassembly application. The training procedure required the creation of nearly 25,000 images in total for training and validation.

In summary, although DNNs generally achieve good recognition accuracies, they require a large dataset of individually labelled images. Moreover, commonly used convolutional neural network (CNN) architectures can only process structured data such as 2D images.

With the development of 3D sensors like RADAR (radio detection and ranging), LiDAR (light detection and ranging), and RGB-D (red, green, blue, and depth channels) camera, 3D data can be easily obtained from field. Typically, the raw data is in the form of point cloud, an unordered set of data points (XYZ coordinates) delineating the surfaces of the scanned object. Although depth information adds valuable context for the identification task, the lack of structure and the uneven distribution of the data points constitutes a challenge for the recognition algorithm.

In particular, the unstructured characteristic of point cloud models cannot be handled by convolutional architectures. For this reason, four main DNN approaches can be identified in the literature. Three of these methods use standard DNN architectures, often including convolutional layers, and feed these architectures with point cloud representations where the information is structured via volumetric [15, 27], multi-scene [28], or graph-/tree-based methods [29,30,31]. The fourth method directly processes the raw point cloud via purpose-designed DNNs like the PointNet architecture [6] used in this work. For a more detailed discussion of deep learning methods for point cloud understanding, the reader is referred to a recently published survey by Guo et al. [32].

PointNet [6] was the first DNN architecture to be able to process directly point cloud scenes. PointNet can be used to perform shape identification or segmentation, and is able to recognise objects regardless of their rotation and translation. Being able to process directly point cloud scenes, PointNet does not require computationally intensive pre-processing steps which may also cause information loss. These features immediately made PointNet very popular for the recognition of real life scenes, and spawned several similar architectures [33,34,35].

Zheng et al. [7] used PointNet to recognise components of two different types of turbochargers for disassembly purposes. The DNN was trained using CAD models of the automotive parts, and tested on point clouds generated using a depth camera simulator. The simulator allowed replicating various degrees of sensor imprecision and partial occlusion of the objects. The PointNet achieved classification accuracy above 90%, although its performance degraded with the addition of simulated sensor imprecision to the model test set. Zheng et al. [7] showed that the effect of sensor imprecision could be counteracted by adding comparable noise to the training data. The method has not been tested yet on real-life images, where the level and distribution of sensor error is not known.

3 The PointNet deep neural network

PointNet was proposed by Qi et al. [6] for object classification and segmentation for point cloud models. It is a DNN constituted of multiple neural layers as shown in Fig. 1, and can be divided into three key modules.

Fig. 1
figure 1

Structure of the classification part of PointNet

The first module is designed to map the input space to a higher-dimensional representation (embedding space), and makes the procedure invariant to rigid transformations of the object’s pose. Differently from the Spatial Transformer proposed by Jaderberg et al. [36], a mini-network (T-Net) is used in PointNet [6]. The T-Net takes all the points from the point cloud as input, and predicts the affine transformation matrix that aligns the object to a canonical space before feature extraction. Additionally, another T-Net (‘feature transform’ in Fig. 1) is used to further align the embedding space.

The second module is the feature extraction module: it is composed of a set of MLPs and a max pooling function [6]. The MLPs are used as feature detectors that are applied to the higher-dimensional embedding space, whilst the max pooling layer is used to aggregate the feature detection result. The overall action of the first two modules is to transform the input information into a feature set. That is, it implements a symmetric function that maps the spatial information in the point cloud to the feature space, irrespective of the object pose.

The third module of PointNet is a fully connected layer that takes the feature information and generates the identification result.

In summary, when a point cloud consisting of (n) points is fed to PointNet, the coordinates of all its points are mapped into the feature space through the first and second modules of the network. The third module of the network is a standard classifier that takes the features extracted in the previous layers, and outputs the classification score for the input scene.

4 Shallow neural network architectures

Shallow neural networks (SNNs) have a longer history than DNNs. Compared to DNNs, SNNs are known to be faster to train, and are less likely to overfit the training data because they use a much smaller number of parameters (weights). Their main limitation is that they need a pre-processing step to extract the vector of input features (variables). In DNNs, feature extraction is performed by the first layers of the architecture, and is optimised by the learning procedure together with the classifier proper (the last layers of the architecture). Nonetheless, when fed with a descriptive set of features, SNNs are known to reach accuracies comparable to those obtained by DNNs [37] in point cloud classification problems.

In this study, the performance of two classical SNN models will be compared to the performance of PointNet. The first SNN is the widely used multi-layer perceptron (MLP). MLP is a popular feed-forward and versatile neural network used for classification and modelling problems [38]. The MLP is usually trained using the back-propagation learning algorithm, which was firstly proposed by Rumelhart et al. [39]. The versatility of the MLP stems from its ability to approximate any function to any desired degree of accuracy [40].

The second is the radial basis function network (RBFN). RBFN was firstly proposed by Broomhead and Lowe [11]. Like the MLP, the RBFN is a popular feed-forward neural network used for modelling and classification problems [38]. The RBFN has a strictly defined architecture, featuring one input layer, one hidden layer, and one output layer. The activation function of the hidden layer is the radial basis function. The input layer of an RBFN acts as a buffer, and broadcasts the input vector to each neuron in the hidden layer. The hidden neurons process the input vector via the activation function (radial basis function), whilst the output neurons perform a linear summation of the weighted outputs of the hidden neurons.

4.1 Numerical feature generation

A point cloud is a data structure containing an unordered list of x-y-z coordinates. Differently from PointNet, the MLP and RBFN treat the input as a vector, and are thus sensitive to the ordering of its elements. Therefore, point clouds cannot be fed directly to the MLP or RBFN. In this section, a feature generation scheme to extract numerical features from the point clouds is described.

In the tests, it is assumed that the objects are in unknown orientation. Thus, the extraction process starts with aligning the shapes with the coordinate axes. For this purpose, principal component analysis [41, 42] was used to extract the eigenvectors of the point cloud. The point cloud is then placed with its centroid in the origin, and its eigenvectors are aligned with axes of the Cartesian coordinates. Since the eigenvectors broadly correspond with the main axes of the shapes, this method was proven to align the point cloud with reasonable accuracy.

In real life, a human eye can recognise a primitive shape from its orthogonal projections onto the three planes \(x=0\), \(y=0\), and \(z=0\). Namely, a cube with its sides aligned with the Cartesian axes will create three rectangular shapes (one per plane), a sphere will generate three disks, and a cylinder will create two rectangular shapes and one disk. In summary, recognising the three 3D primitive shapes boils down to recognising two 2D shapes (rectangle and disk) in their projections. This idea is exploited as follows.

After principal component analysis alignment, the points in the cloud are projected onto the three planes \(x=0\), \(y=0\), and \(z=0\). For example, the projection of a point of coordinates \(z=(x,y,z)\) onto the \(z=0\) plane is \(z=(x,y,0)\). For each 2D projection on a plane:

  1. 1.

    The coordinates of all the points are transformed into 2D polar-coordinates: \((r, \theta )\).

  2. 2.

    The plane is divided into 64 sectors (intervals of \(\theta\)).

  3. 3.

    For each of the 64 sectors, the radius r of the most distant point from the origin is taken as representative of the interval. Namely, the representative of interval \(1 \le j \le 64\) is \(m_j=max(r_i)\) where \(1 \le i \le N\) indicates one of the N points of the cloud.

  4. 4.

    The arithmetic mean and standard deviation of the 64 representatives \(m_j\) of the interval are calculated for each plane.

The features extracted from one point cloud form a 6-dimensional vector:

$${(m_x, \, \delta _x, \, m_y, \, \delta _y, \, m_z, \, \delta _z)}$$

where \({m_k}\) and \({\delta _k}\) (\(k=x,y,z\)) are respectively the mean and standard deviation of the 64 representatives on the planes \(x=0\), \(y=0\), and \(z=0\). In a perfect point cloud without error, all the most distant points of a disk will be at the same distance from the origin; hence, \(m_x=m_y=m_z\) and \(\delta _x=\delta _y=\delta _z=0\). Also, the most distant points of a rectangle will not be at the same distance from the origin, \(\delta _{x,y,z} \ne 0\), and in general \(m_x \ne m_y \ne m_z\). This will hold as long as the error level is reasonable, namely that it will not completely blur the shapes of the projections, or when the model has been cleaned of sensor error.

5 Model sets

The goal of this study was to evaluate the ability of PointNet to recognise primitive shapes in point clouds generated from scans of real objects, after being trained on sample point clouds of geometric shapes. All the point clouds were normalised before being fed into the PointNet: the normalisation procedure shrank or enlarged the shape without deformation to fit it into a box of side 1. The procedure described in Sect. 4.1 was run to extract numerical features from the scenes for the SNNs.

At present, there is no specific model set in the literature for benchmarking the accuracy of primitive shape classifiers on real-life scenes. For this study, a popular benchmark of 3D models of real-life objects was used: the YCB model set [12]. This model set was originally created and mainly used for robotic manipulation, instead of classification purposes. A subset of twenty-eight models from the YCB set was retained, containing objects of three basic primitive shapes: boxes, cylinders, and spheres. This subset was used to generate a test set of examples that will be henceforth called the YCB-28 model set. In all the experiments, the YCB-28 model set was employed to validate the learning accuracy of the trained neural networks. The methodology followed to create the YCB-28 model set is detailed in Sect. 5.1. To train the classifiers, two artificial model sets were used. They are presented in Sects. 5.2 and 5.3. The normalisation procedure for point clouds before being fed into PointNet is described in Sect. 5.4

5.1 The Yale-CMU-Berkeley (YCB) object and model set

The Yale-CMU-Berkeley object and model set was created by Calli et al. [12] for research in robotic manipulation. Calli et al. [12] used two series of depth cameras (BigBIRD Object Scanning Rig and Google Scanners) to capture point clouds from several real-life objects from multiple angles of views. The point clouds captured from each object were merged and de-noised to create mesh models, using the truncated signed distance function method [43] and Poisson reconstruction [44]. Only one mesh model was created for each object [12]. The mesh models were used for the experiments presented in this paper.

Differently from large classification sets like ModelNet40 [15], which contains 12,311 items from 40 different categories, the YCB set contains point clouds and mesh models from only 77 daily-life objects. These objects were broadly classified by Calli et al. [12] in 5 main categories: food items, kitchen items, tool items, shape items, and task items. In this study, the objects were grouped by their shape, and samples of appearance reasonably close to boxes, cylinders, and spheres were picked. The twenty-eight selected samples included nine models of box-shaped objects, eight models of cylinder-shaped objects, and eleven models of sphere-shaped objects.

The names and IDs (progressive identification number in the YCB set) of the twenty-eight selected models are reported in Table 1, and their pictures and mesh models are shown in Figs. 2, 3, and 4. Despite the de-noising and reconstruction, the mesh models still contain a certain level of sensor error, which is visible as irregularities such as bumps and hollows in the figures.

Table 1 IDs and names of the selected twenty-eight objects from the YCB set
Fig. 2
figure 2

The nine selected box-like objects (images and meshes) from YCB set

Fig. 3
figure 3

The eight selected cylinder-like objects (images and meshes) from YCB set

Fig. 4
figure 4

The eleven selected sphere-like objects (images and meshes) from YCB set

The following three-step procedure was used to generate the YCB-28 model set out of the twenty-eight selected object scans. The first step was to randomly sample with uniform probability 1,000,000 points from the mesh model of each object. This initial large point set was called the point pool. The second step was to create 20 point clouds by randomly sampling 1000 out of the 1,000,000 points from the point pool. Each of the 20 point clouds created from one object model contained a different sample of points. Given that the sampling rate was (1/1000), it is reasonable to think that any two of the 20 point clouds had very little sampled points in common. Finally, in the third and last step, each point cloud was centred on the origin and the shape randomly rotated (roll-pitch-yaw rotation). In detail, the YCB-28 model set contained the following point clouds:

  • Box: \({9\ objects} \times {20\ point clouds} = {180\ point clouds}\)

  • Cylinder: \({8\ objects} \times {20\ point clouds} = {160\ point clouds}\)

  • Sphere: \({11\ objects} \times {20\ point clouds} = {220\ point clouds}\)

  • Total: \({560\ point clouds}\)

In summary, the YCB-28 model set contains 560 point clouds sampled from twenty-eight mesh models generated from real scenes, and was created for final performance testing.

5.2 The artificial primitive shapes (APS) sets

The APS set was used to train the neural network classifiers. This model set was originally created by Baronti et al. [45] for research on primitive shape fitting. The shape generation software can be downloaded from Baronti’s GitHub repositoryFootnote 1.

The APS set contains point cloud models of the following three geometric shapes: box, cylinder, and sphere. Baronti et al. [45] created 591 different artificial shapes by changing the height (H), width (W), breadth (B), and diameter (D) of the three geometric shapes. The APS set was created from a full-factorial combination of the parameters defining each shape. The height, width, and breadth of the shapes were incremented from 1 to 10 in steps of 1 units. The diameter of the base of the cylinders was incremented from 0.5 to 5 in steps of 0.25 units. The diameter of the spheres was incremented from 1 to 10 in steps of 0.05 units. In summary:

  • Box: \({220\,point \ clouds\,with\,H,\,W,\,B\in \{1,2,...,10\}\,(H\ge W \ge B)}\)

  • Cylinder: \({190\,point \ clouds\,with\,D\in \{0.5,0.75,...,5\}\,and\,H\,\in \{1,2,...,10\}}\)

  • Sphere: \({181\,point \ clouds\,with\,D\in \{1,1.05,1.10,...,10\}}\)

  • Total: \({591\,point \ clouds}\)

Given that the point clouds represent artificial objects, and are normalised before being fed to the PointNet, the unit of measurement of the H, W, B, and D parameters was not specified. It should also be noted that information on the size is not relevant to the determination of the shape of an object. All the shapes were placed with their centres at the origin of a Cartesian coordinate frame, and randomly oriented. The point clouds of the APS model set represent perfect shapes, since they are not corrupted by any sensor error. They are also complete, as opposed to real-life scans, like those of the YCB-28 set, where at least the information of base is missing since not reachable by the scanners. This set of perfect shapes will be henceforth called APS-clean.

Baronti’s software allows also injecting error (local imprecision simulating sensor inaccuracy) into the point clouds, as shown in Fig. 5. For details of the procedure used to inject error in the scenes, the reader is referred to [45]. A new model set was created duplicating the elements of the APS-clean set, normalising them within a bounding box of side 1, and for each element perturbing the position of the points of an amount randomly drawn with uniform probability within the interval \([-0.025,+0.025]\). This new model set will be henceforth called APS-error

Fig. 5
figure 5

Example of created box-shape point cloud, from left to right: point cloud without any error, and point cloud with error [45]

Finally, a validation set containing 200 point clouds for each primitive shape was created. These shapes had random dimensions (H, W, B, D) and contained no sensor error. Henceforth, this set will be called APS-clean-val.

In summary, three model sets were created for the experiments from the original APS set: APS-clean and APS-error for training the classifiers, and APS-clean-val for optimisation of the SNNs and DNNs.

5.3 YCB-similar artificial primitive shapes

For each primitive shape, the APS set contained a wide range of shape variations. To make the training of the classifiers more focused on the recognition of the shapes of the YCB-28 objects, one more model set of artificial primitive shapes was created. This model set contained artificial primitive shapes of features (H, W, B, D) more similar to those of the YCB-28 objects. This new set allowed simulating the case where some knowledge about the expected shape of the objects is available.

Specifically, the Open3D open-source library [46] was obtained to enquire the shape features from the mesh models of the twenty-eight objects selected from the YCB set. Each mesh model was firstly visualised using Open3D. The shape features (H, W, B, D) of the objects were measured from the coordinates of manually picked key points from the visualised model. Three examples of the manually measured shape features are shown in Table 2, and the three objects and their mesh models are shown in Fig. 6, whilst the complete list of all the measured shape features from the twenty-eight selected YCB mesh models is detailed in Appendix 1. It should be noted that all point clouds are normalised in size before being fed to the classifiers. Therefore, the important information in the figures in Appendix 1 is not in the size but the proportions of the features.

Fig. 6
figure 6

Original YCB objects, their mesh models, and point cloud models of similar shape

Table 2 Manually measured shape features of three sample mesh models from YCB-28 set

Afterwards, twenty point clouds were generated from each of the twenty-eight selected objects based on their measured shape features. To generate a new point cloud, each shape feature (H, W, B, D) was independently modified of a random amount. That is, each feature \({K\in (H, W, B, D)}\) of the primitive shape was randomly changed of an amount within the \([-5\%,+5\%]\) range of its size:

$$\begin{aligned} K' = [1 + x \sim U(-0.05,+0.05)] \times K \end{aligned}$$
(1)

where \(x \sim U(-0.05,+0.05)\) is a randomly sampled number from the uniform distribution \(U(-0.05,+0.05)\). Consequently, each element of the new set of generated point clouds was similar in shape, although not the same, to the twenty-eight selected objects in YCB-28. Figure 6 shows examples of artificial point clouds near the real-life objects from which they were generated.

Henceforth, this new model set will be named YCB-similar. Its structure is as below:

  • Box: \({9\,objects \times 20\,point clouds = 180\,point clouds}\)

  • Cylinder: \({8\,objects \times 20\,point clouds = 160\,point clouds}\)

  • Sphere: \({11\,objects \times 20\,point clouds = 220\,point clouds}\)

  • Total: \({560\,point clouds}\)

In summary, the point cloud models of the YCB-similar set simulate the shape of objects in the YCB-28 set. They contain no sensor error, and will be used for training the classifiers.

5.4 Point cloud normalisation

The point clouds were rescaled into a size 1 bounding box before they were fed to the PointNet. The procedure consists of the following steps. Given a point cloud \({\mathcal {PC}}\) containing N points, each point \(\mathbf{{p_i}}\) is represented by coordinates \({(x_i, y_i, z_i)}\) in a Cartesian 3D frame F:

$$\begin{aligned} \mathcal {PC} = \lbrace \mathbf{{p_1, p_2, ...\, p_i, ...\, p_N }}\rbrace \end{aligned}$$
(2)
$$\begin{aligned} \mathbf{{p_i}} = (x_i, y_i, z_i),\ i \in [1,N] \end{aligned}$$
(3)

An initial rigid transformation is made so as all points \(p_i\) are described by positive \((x_i, y_i, z_i)\) coordinate values:

$$\begin{aligned} { \mathcal {PC}^{\prime } = \lbrace {\mathbf{{p^{\prime }_i}}}\rbrace = \lbrace {\mathbf{{p_i}} - \mathbf{{p_{min}}}} \rbrace , \ i \in [1,N] } \end{aligned}$$
(4)

where \(\mathbf{p_{min}} = (x_{min}, y_{min}, z_{min})\) and

$$\begin{aligned} {x_{min}=\min _{\forall i \in [1,N]} x_i}, \ {y_{min}=\min _{\forall i \in [1,N]} y_i}, \ {z_{min}=\min _{\forall i \in [1,N]} z_i} \ \end{aligned}$$
(5)

The point cloud is then scaled based on the diagonal of its bounding box, and limited within the interval \({x\in [0,1],\ y\in [0,1],\ z\in [0,1]}\):

$$\begin{aligned} \mathcal {PC}^{\prime \prime } = \lbrace {{\textbf{p}^{\prime \prime }_\textbf{i}}}\rbrace = \Big \{ {\frac{\textbf{p}^{\prime }_\textbf{i}}{s}} \Big \}, \ i \in [1,N] \end{aligned}$$
(6)

where

$$\begin{aligned} s = \sqrt{(x^{\prime }_{max})^2 + (y^{\prime }_{max})^2 + (z^{\prime }_{max})^2 } \end{aligned}$$
(7)

and

$$\begin{aligned} x^{\prime }_{max}=\max _{\forall i \in N} (x^{\prime }_i), \ y^{\prime }_{max}=\max _{\forall i \in N} (y^{\prime }_i), \ z^{\prime }_{max}=\max _{\forall i \in N} (z^{\prime }_i), \ \end{aligned}$$
(8)

Finally, the point cloud is again translated so as its centroid (C) coincides with the origin of the Cartesian frame F:

$$\begin{aligned} \mathbf{{C}} = (x_c,y_c,z_c) = ( \frac{1}{N}\sum _{i=0}^N(x^{\prime \prime }_i), \frac{1}{N}\sum _{i=0}^N(y^{\prime \prime }_i), \frac{1}{N}\sum _{i=0}^N(z^{\prime \prime }_i)\end{aligned}$$
(9)
$$\begin{aligned} {\bar{\mathcal{PC}}} = \lbrace {\mathbf{p^{\prime \prime }_i} - \mathbf{{COG}}}\rbrace , \ i \in [1,N] \end{aligned}$$
(10)

This procedure was introduced to resize the point clouds to a bounded space. It is important to notice that the normalisation does not rescale the objects to a standard size, since the size after normalisation depends not only on the view but also the orientation of the objects. For example, the size of a cuboid will be largest when all its sides are aligned to the coordinate axes, and smallest when one of its diagonals is aligned to the coordinate axes.

6 Experiments and results

This section describes the experimental setup and the results obtained by PointNet and the two SNN classifiers. Three artificial model sets, APS-clean model set, APS-error model set, and YCB-similar model set, were used for training the neural networks. The final performance of trained classifiers was evaluated on the YCB-28 model set. In all experiments, 10 independent learning trials were performed, and the results were statistically analysed.

The PointNet architecture used in the experiments was obtained from open-source code made available by Qi et al. [6] in their Github repositoryFootnote 2. Following the methodology of Qi et al. [6], the PointNet was trained using the Adam optimiser [47]. Adam updates the network weights based on gradients calculated from randomly picked mini-batches of point clouds of predefined size. The batch size is an important hyper-parameter of the algorithm.

Based on preliminary tests, the structure and most of the hyper-parameters of PointNet were kept as originally designed by Qi et al. [6], and the remaining hyper-parameters were manually optimised. They are shown in Table 3. The MLP was trained using the standard BP algorithm with momentum term [39], whilst the RBFN was trained using first a KNN-based algorithm for a broad brush optimisation of the radial basis function parameters, and then the BP algorithm to fine tune the whole network parameters. The hyper-parameters of the SNNs and those of their learning procedures were optimised by trial and error. They are detailed in Table 4.

Table 3 Hyper-parameters of PointNet used in the experiments
Table 4 Hyper-parameters of MLP and RBFN used in the experiments

Two hyper-parameters, the batch size used by Adam training procedure and the number of training epochs, have the largest effect on the learning accuracy of PointNet. Section 6.1 describes the procedure followed for their experimental optimisation. This procedure was carried out using the artificial APS-clean model set. After optimisation, since the PointNet was always tested on the same benchmark problem (YCB-28), the hyper-parameters were fixed for all the experiments. The results of the experiments are reported in Sect. 6.2.

6.1 Optimisation of the PointNet training procedure

PointNet was trained using the APS-clean model set, and the results validated using APS-clean-val set. That is, the DNN was trained and optimised using only knowledge from perfect artificial shapes.

6.1.1 Batch size optimisation

To optimise the batch size used by the Adam optimiser, tests were performed varying the hyper-parameter from 10 to 100 in steps of 10, fixing the number of training epochs to 200. Ten learning trials were performed for each batch size setting.

Fig. 7
figure 7

Performance of PointNet as the batch size setting is varied. The DNN was trained using the APS-clean and the learning accuracy tested on the APS-val model set

The experimental results are shown using box plots in Fig. 7. The red line within the box indicates the median result of the 10 independent learning trials. In terms of median accuracy, the performance improves noticeably when the batch size is increased from 10 (93.34% median accuracy) to 20 (97.67%), and from 40 (98.0%) to 50 (99.33%). Both improvements are statistically significant at an \(\alpha = 0.01\) confidence level: the p-value is 0.0343 for the difference between the results obtained using batch sizes of 10 and 20, and \(p=0.0006\) for the difference between the results obtained using batch sizes of 40 and 50.

Beyond a size of 50, the statistical analysis suggests there are no benefits in any further increase of the batch size. However, as the batch size increases, the training procedure appears to become more consistent (smaller width of the box plots), and for this reason a batch size 100 was chosen. Although this choice is the most computationally intensive, the training process can be sped up using GPU acceleration.

Since the number of PCs constituting the model sets used in the experiments is not a multiple of 100, the number of training examples in the last batch fed to the Adam optimiser had to be brought to 100. This was achieved by duplicating randomly picked PCs from the whole model set. For example, the APS-error set contained 591 PCs, which were fed in 6 batches of 100 PCs each. The last batch was formed by the remaining unused 91 examples, and 9 randomly picked duplicates.

6.1.2 Training epoch optimisation

After the batch size had been fixed to 100, the number of training epochs was optimised by trial and error. Experiments were performed increasing the number of epochs from 20 to 200 in steps of 20. The results are shown in Fig. 8.

Fig. 8
figure 8

Performance of PointNet as the number of training epochs is varied. The DNN was trained using the APS-clean and the learning accuracy tested on the APS-val model set

Figure 8 shows a progressive improvement in the performance of PointNet as the number of training epochs is increased until 160. Pairwise Mann-Whitney statistical tests indicated that the performance of PointNet trained using 160 epochs is significantly superior to the performance obtained using any smaller number of training epochs (from 20 to 140 epochs). Further increases of the number of training epochs beyond the 160 did not yield any significant improvement in performance. Consequently, this hyper-parameter was fixed to 160.

6.2 Experimental results — artificial model sets

As mentioned in Sect. 6, three instances of PointNet were trained using three artificial model sets: APS-clean, APS-error, and YCB-similar, using the hyper-parameters shown in Table 3 and discussed in Sect. 6.1. The performance of the trained PointNets was evaluated on their accuracy on the YCB-28 model set, and compared to that of two SNNs: an MLP and an RBFN. The results of the experiments are visualised using box plots in Fig. 9, and fully detailed in Table 5. Each box in the figure visualises the five-number summary of the accuracy results attained in the 10 independent learning trials. The significance of the differences in the results obtained in the various sets of learning trials was evaluated via pairwise Mann-Whitney tests. Table 6 fully details the results of the significance tests.

Fig. 9
figure 9

Accuracy results obtained on the YCB-28 model set by the three classifiers when trained using different sets: APS-clean (Clean), APS-error (Error) and YCB-similar (Similar)

Table 5 Shape identification accuracies obtained on the YCB-28 model set by the three neural network architectures. Each column reports the results of 10 independent learning trials performed using a different training set: APS-clean (Clean), APS-error (Error), and YCB-similar (Similar)
Table 6 Mann-Whitney test results for each pair of experiments. Results equal to or below the \(\alpha = 0.01\) confidence level are reported in bold

Figure 9 shows that PointNet achieved an average (median) accuracy of 86% circa when trained on the APS-error model set. If a stringent \(\alpha =0.01\) confidence level is sought, Table 6 indicates that when trained on the APS-error model set, PointNet outperformed any other combination of classifier and training set, except for the MLP trained on the clean set. If the confidence level is relaxed to the often used \(\alpha =0.05\), it can be said that PointNet trained on the APS-error set was the clear winner of the comparison.

PointNet did not perform equally well when trained using the APS-clean and APS-similar model sets, although the performance on the latter (81% circa average accuracy) was still adequate.

Despite the unsophisticated feature extraction method used, the MLP performed remarkably well. Trained using the APS-clean model set, it achieved nearly 83% training accuracy, and slightly less (82%) when trained using the APS-error set. Compared to the PointNet, the MLP obtained more consistent learning results, as shown by the width of the box plots. The RBFN was the clear underperformer of the three tested classifiers. Finally, training the classifiers on the YCB-Similar set did not provide any visible benefit, particularly for the two SNNs.

7 Discussion

In this study, the hyper-parameters of PointNet were tuned using artificial models. The exercise can be seen as an attempt to evaluate whether knowledge gained on artificial models could be transferred to real-life scenes. The study aimed also at testing whether simple SNNs were able to obtain results comparable to those obtained by the much more complex PointNet.

The experimental tests showed that, in terms of accuracy, PointNet had indeed an edge, albeit small, on a standard shallow MLP classifier. However, the MLP showed more consistent training results. The tests also indicated that PointNet performs best (85.98%) when trained on scenes that were perturbed with some level of random error. Training PointNet using shapes of features similar to those to the real images, instead of training it with more general artificial shapes, did not significantly improve the accuracy of the classifier.

The above results suggest that the generalisation accuracy of PointNet is likely to be sensitive to sensor error. As the mesh models in Figs. 2, 3, and 4 show, this error had not been completely removed by the pre-processing procedures. Trained for 160 epochs on perfect artificial shapes, PointNet was able to obtain nearly perfect recognition accuracy (99.59%) on previously unseen artificial shapes (Fig. 8). However, the average accuracy of PointNet dropped to 78.84% when the trained network was tasked with recognising the shape of real-life objects. The learning results were also not consistent, and widely varied in quality between the ten independent learning trials performed (Table 5).

Training PointNet on models of features that are closer to those of the real-life objects (YCB-Similar) produced only a marginal improvement in accuracy and consistency, because the YCB-Similar models still consist of clean geometric shapes. However, training PointNet using noisy scenes (APS-error) markedly improved its learning accuracy and consistency. That is, training the DNN to recognise ‘imperfect’ shapes increased its ability to correctly classify shapes from imperfect real-life scans.

The difficulties encountered by PointNet to generalise the knowledge learned from artificial models to models of real-life objects have been already reported by other authors [14]. These difficulties are common to the general DNN field, where large over-parameterised structures are often able to perfectly fit the training data, and have issues of poor generalisation or overfitting [48, 49].

A contribution from this study is the confirmation of the validity of the idea of injecting local imprecision in the training shapes, so as to ‘blur’ their boundaries and prevent PointNet from learning the perfect examples. This approach bears similarities with data augmentation techniques where slightly modified copies of already existing data are added to the training set, in order to regularise the neural network models [50]. Other similar regularisation procedures contaminate the input patterns with randomly re-sampled noise at each iteration of the learning procedure [51, 52]. The common approach of all these procedures is to oversample the training set to smoothen the mapping of the neural network model. Their common goal is to optimise the bias-variance trade-off of the learned model, and promote generalisation [51]. Rather than smoothening PointNet mapping, the approach used in this study aims to promote a ‘tolerance’ to local imprecision, similar to the approximation threshold used in the RANSAC algorithm [53]. The proposed approach also does not augment the training data or re-sample the noise at every iteration, promoting thus the efficiency of the learning procedure.

It should be noted that both the SNNs obtained their best learning accuracies when trained on the APS-clean set. This result might be due to the particular feature extraction method used (Sect. 4.1), where the simulated sensor error might have excessively blurred the shape of the projections of some objects. It may also indicate that the SNNs did not overfit the training data. In general, trained on the APS-clean set, the MLP obtained higher classification accuracies and more consistent results than PointNet.

The main advantage of PointNet is that the data requires only minimal pre-processing (normalisation and down sampling), beyond the standard cleaning of the raw point cloud scenes. In particular, PointNet does not need the feature extraction process required by the MLP. The feature extraction process is embedded in the first block of layers of the PointNet, and optimised simultaneously to the classifier by the learning algorithm. In this study, the feature extraction process was carried out prior to the MLP training procedure, and it is possible that the criterion of the former did not perfectly match the inductive and representational biases of the latter. The concurrency of the feature extraction and classification procedures might have given and edge to the PointNet respect to the MLP.

7.1 Indications for future work

This study proved that PointNet can be trained using artificial data to recognise with good accuracy shapes from real-life scans of objects. This approach makes it easier for designers to build the usually large data set needed to train the classifier.

The experimental work was based on the recognition of three primitive shapes: box, cylinder, and sphere. Further work should validate the proposed method on a more varied and complex set of objects. In particular, given the context of robotic disassembly, the proposed approach should be validated on models of real mechanical parts. Although preliminary tests suggested the applicability of the technique to complex automotive components [7], the simulations did not take into account real-world occurrences such as reflective surfaces.

The main hurdle to an extensive testing of the proposed approach has been so far the lack of a database of real-life mechanical object models. The assembly of such set has been hampered by the restrictions due to the recent pandemic. The collection and scanning of object samples is now a priority.

PointNet has shown a tendency to overfit the training data, showing poor generalisation capability on noisy data. Addition of noise to the training samples boosted the performance of PointNet. Further work should be done to test other regularisation techniques such as dropout [54] and weight decay [55].

The generalisation ability of PointNet could also be improved by decreasing the complexity of its architecture. Based on preliminary tests, this architecture has been kept so far similar to the one originally designed by Qi et al. [6] (see Sect. 6). A more thorough analysis might reveal the advantage of more economical structures.

Finally, the segmentation ability of PointNet should also be explored. Scene segmentation will be very useful in disassembly scenarios to identify end-of-life products and their sub-assemblies in scanned scenes.

8 Conclusions

This study investigated the possibility of training the PointNet DNN on point cloud models of perfect geometric primitive shapes, and use it to recognise primitive shapes in models of daily-life objects. The ultimate objective of the study is to use PointNet to generate shape information for robotic manipulation and disassembly of end-of-life products.

Experimental tests showed that, trained on perfect geometric shapes, PointNet was able to recognise with nearly 80% average accuracy primitive shapes in real-life objects. The tests also showed some inconsistency in the performance of the DNN. Trained on perfect geometric shapes using a simple feature extraction method, a simple shallow MLP architecture obtained better results than the PointNet in terms of average accuracy and consistency of the learning results. The main difficulty found by PointNet seemed to consist of generalising the knowledge gained on perfect artificial shapes to real-life cases. This finding confirmed the results of a handful of similar experiments in the literature, and is the first contribution of this study.

The accuracy of PointNet could be raised to nearly 86% by locally perturbing the position of the elements of the training point clouds. This operation corresponded to blurring the representation of the shapes, in order to train the DNN on imprecise models that are more similar to real-life representations. In this study, this new training method has been verified on the recognition of shapes from real-life objects. In addition to improving the recognition accuracy, it also greatly improved the consistency of PointNet learning results. This result constitutes the second contribution of this study.

Indications for further work were given in Sect. 7.