Noise4Denoise: Leveraging Noise for Unsupervised Point Cloud Denoising

Existing deep learning-based point cloud denoising methods are generally trained in a supervised manner that requires clean data as ground-truth labels. However, in practice, it is not always feasible to obtain clean point clouds. In this paper, we introduce a novel unsupervised point cloud denoising method that eliminates the need of using clean point clouds as ground-truth labels during training. We demonstrate that it is feasible for neural networks to only take noisy point clouds as input, and learn to approximate and restore their clean versions. In particular, we generate two noise levels for the original point clouds, requiring the second noise level to be twice the amount of the first noise level. With this, we can deduce the relationship between the displacement information that recovers the clean surfaces across the two levels of noise, and thus learn the displacement of each noisy point in order to recover the corresponding clean point. Comprehensive experiments demonstrate that our method achieves outstanding denoising results across various datasets with synthetic and real-world noise, obtaining better performance than previous unsupervised methods and competitive performance to current supervised methods.


Introduction
Point clouds are used in a wide range of scenarios, such as autonomous driving, immersive reality, robotics, and remote sensing.Despite their notable advantages such as intuitive representations and lightweight characteristics, raw point clouds from sensors are usually corrupted with noise which can degrade their positional accuracy.Such noise comes in the form of displacement vectors that perturb the points and prevent them from accurately representing the objects' surfaces.Noise contamination is often caused by the sensors' quality and precision limitations or by factors from the environment, including material reflectivity and lighting conditions.
As such, point cloud denoising is an essential and fundamental research problem.However, since point clouds are unordered and lack connectivity information [1,2], removing noise from them has been a long-standing challenge over the decades.Traditional point cloud denoising methods, based on optimisation techniques [3][4][5], often need non-trivial parameter tuning and thus can be burdensome for users.Recently, deep learning-based denoising methods [6][7][8] have alleviated the burden of parameter tuning and enhanced the robustness of the denoising performance.Most existing learning-based methods require pairs of noisy data and the corresponding ground-truth (i.e., clean) data to guide training.However, the ground-truth point clouds are not always available, as it is infeasible to obtain noise-free point clouds in some scenarios due to scanners' precision limitations or environmental constraints.In recent years, unsupervised denoising methods for images, including [9][10][11], have demonstrated desirable results for noise removal without training the networks using clean data.Motivated by this, we aim to design a new neural network to denoise 3D point clouds that does not require ground-truth data as training labels.
In this paper, we propose a novel unsupervised point cloud denoising framework.The primary insight of our method is that neural networks can learn to remove noise without requiring clean points as labels.During training, we generate two versions of point clouds: one with standard noise and another with double noise.Then, we deduce the relationship between the displacement information in these two noisy versions, enabling us to design an effective loss term for minimisation.Our network leverages an encoder-decoder architecture which takes in local neighbourhoods from the doubly-noisy point clouds, learns their features, and subsequently outputs displacement vectors to restore clean point positions.Notably, our training is accomplished using only the noisy point clouds, without reference to clean data.Extensive experiments on different datasets demonstrate that our method can be generalised to both synthetic and real-world noise, showing comparable performance with supervised methods.
In summary, the contributions of our approach are outlined as follows: • We propose an unsupervised point cloud denoising network that eliminates the need of ground-truth point clouds as labels during training.• Our network is designed to learn noise residuals using merely noisy data, without additional inputs beyond point clouds during training and testing.
• Our approach can achieve outstanding denoising results on synthetic and real-world noise compared to existing unsupervised and supervised methods.

Optimisation-based methods
Traditional point cloud denoising methods are typically optimisation-based, removing noise by optimising a set of geometrical constraints.A main category of existing methods is based on moving least squares (MLS) which targets to approximate the clean underlying surfaces of the input shapes.Based on the seminal work [12], Alexa et al. [3,13] proposed an MLS-based method to approximate the clean surfaces of the input noisy point clouds.Several other relevant variants were proposed later, including Point Set Surfaces [14], Robust MLS (RMLS) [15], and Robust Implicit MLS (RIMLS) [16].Another category of existing methods is based on locally optimal projection (LOP) which projects noisy points onto the optimal surfaces.The early work, proposed by Lipman et al. [4], eliminates the burden of parameterisation.Subsequently, further works were proposed, including Weighted LOP (WLOP) [17], Edge Aware Resampling (EAR) [18], and Continuous LOP (CLOP) [19].There are also other denoising methods including jet fitting (Jet) [20], bilateral filtering (Bilateral) [21], feature-preserving methods [22,23], and the method based on aligning non-local similar patches (denoted as Non-local) [5].

Supervised learning-based methods
As deep learning techniques have been increasingly used to handle point clouds [1,24], data-driven point cloud denoising methods, most of which are supervised, have demonstrated their robustness and generalisability.One of the early works is PointCleanNet (PCNet) [6] which requires pairs of noisy and clean point clouds for training.Luo and Hu [7] proposed Differentiable Manifold Reconstruction (DMR), an autoencoder-like network that reconstructs manifolds and re-samples denoised points.Nevertheless, the resampling process may blur details on the shapes.To address this limitation, Score-based Denoising (Score) [8] utilises gradient ascent to guide the denoising process.In recent years, other supervised methods have also been proposed [25][26][27][28].

Unsupervised learning-based methods
While supervised methods have demonstrated superior capabilities, ground-truth labels are not always available during training.To address the issue, unsupervised point cloud denoising methods have emerged over the years.Hermosilla et al. [29] proposed Total Denoising (TotalDn) which achieves noise removal by converging noisy points to their unique modes.However, it usually requires multiple denoising iterations to effectively remove noise.In the aforementioned DMR denoising method [7], Luo and Hu observed that points with denser neighbourhoods are typically closer to the groundtruth surfaces.Thus, these points can be treated as denoising objectives.With that in mind, they designed an unsupervised version of DMR which trains the network using an altered unsupervised loss function.Similarly, Score [8] exploits an unsupervised loss that shares the same training objective as the supervised one.However, the unsupervised loss tends to cluster points together, resulting in denoised point clouds that are distributed unevenly.

Unsupervised image denoising
Conventional image denoising operations, such as Non-local Means [30] and Bilateral Filter [31], as well as supervised methods [32,33], have demonstrated promising capabilities and have been widely adopted.However, noise-free images are not always available as ground-truth labels during the training process.Over the years, a range of unsupervised image denoising methods have been proposed, demonstrating results comparable to those of conventional and supervised methods.For instance, Deep Image Prior [34] can learn image priors and produce high-quality images without taking ground-truth data as input.Lehtinen et al. [9] proposed Noise2Noise where the network only observes noisy images and learns to restore their clean versions.Krull et al. [10] proposed Noise2Void where the objective pixels' values are masked, and the network attempts to predict those values from neighboring patches.Similarly, Batson and Royer [35] proposed Noise2Self which aims to predict residual noise from noisy images alone.Moran et al. [11] proposed Noisier2Noise to restore clean images from additional noise that is added to noisy images.In summary, the aforementioned unsupervised methods showcase superior noise removal capabilities on images.
3 Our method

Problem formulation
In this section, we formulate the relationship of displacement information across different versions of noisy point clouds, and establish a structured approach to approximate the clean points that is inspired by unsupervised image denoising [11].We start with the basic definitions: a clean point cloud P with T points can be defined as a set of 3D coordinates as where p i = (x i , y i , z i ) stands for each point in P. We also define the noisy version of P (denoted as P) as where pi stands for each noisy point in P, N is the noise component that perturbs each point, and we denote N ∼ A where A stands for the noise distribution.
We then define another noise component M sampled from the same noise distribution, such that M ∼ A. Note that the two noise components M and N are independent and identically distributed (i.i.d.), but are NOT identical to each other.We add M to P to obtain another noisier point cloud P′ where such that P′ is doubly-noisy and contains twice the amount of noise in P. The concepts of clean, noisy and doubly-noisy point clouds are shown in Fig. 1.Our network has no access to the ground-truth point clouds during training.
We argue that it is feasible to estimate the clean point cloud P based on the noisy inputs.Given P′ as the prior, which represents the observed doubly-noisy point cloud, we denote the estimation of the overall surface of the noisy point cloud P in an expected value format E( P| P′ ).Based on Eq. ( 2), we have (5) which thus leads to Here, we denote E(P| P′ ) = P, E( P| P′ ) = P and E( P′ | P′ ) = P′ , where P, P and P′ stand for the predictions of P, P and P′ , respectively.We substitute the notations back to Eq. ( 6) and have It is worth noting that P′ = E( P′ | P′ ) is equivalent to P′ itself based on the rules of conditional expectation and point clouds' discrete characteristics.We also assume the points in P, P and P′ have one-to-one correspondences.Based on this assumption, we can estimate a noise displacement vector for each point in P′ , and use these vectors to build the approximated clean point cloud P. Here, we denote the relationship as where d′ i represents the predicted displacement vector for point p′ i and directly estimates the position of the corresponding clean point.
The prediction P can also be obtained by adding another Fig. 2 The network architecture of our method.We employ normalisation before inputting the patch into our encoder, and do denormalisation during testing phase.
set of displacement vectors on P′ .We denote that as Based on our one-to-one correspondence assumption, we finally demonstrate that the direct estimation d′ i is equivalent to twice di by substituting the values back into Eq.( 7): which guides us in designing our loss function for training.

Denoising framework
Our network architecture is shown in Fig. 2. Similar to previous literature [6,25,27], we design a point cloud denoising pipeline based on local neighbourhoods as patches.The noisier point cloud P′ is the input source of our network, and we define the local patch P′ i around each query point p′ i as where r is the query ball's radius, ∥ • ∥ 2 represents L2-norm, and p′ j stands for the neighbour points within the radius r.During training, we empirically set r to be equivalent to 5% of the bounding box diagonal length of P′ , which is a common setting in prior literature [25,27].Meanwhile, we also obtain pi , the point with the corresponding index i in point cloud P, for training purposes.
We then perform the normalisation process to minimise the arbitrary degrees of freedom in the queried patches and facilitate effective training.We first centre P′ i by translating it to its query point and normalise its size using r, such that P′ i = ( P′ i − p′ i )/r.Then, we align P′ i with the canonical space using the rotation matrix R which was obtained via Principal Component Analysis (PCA) decomposition on P′ i .We also process pi following the same transformation procedure for P′ i .Next, as the number of points may be inconsistent with each other in raw patches, we set an empirical threshold N = 500 to regularise the patch size for P′ i and ensure they can be grouped into mini-batches.Specifically, following previous works, we downsample patches with more than N points, and fill extra points for patches with fewer points than N .Note that during testing phase, we require the inverse of the aforementioned transformations (rotation, scaling and translation) to map the denoised point back to its original position.
We input the normalised patch P′ i (as an N × 3 matrix) into our point cloud encoder.It uses graph-based Dense Block modules [36] as backbone and encodes each point with its neighbours' information.To further enhance the network's generalisation performance, we adopt the Global Shift module in [37] to fuse patch features into fewer points, in order to improve the robustness of our network and increase the inference efficiency.The output of our encoder is a feature matrix which goes through our output module to form the final predicted displacement d′ i , a 1 × 3 vector.During training, we utilise the noisy point pi to train the estimation of the displacement vector.

Loss functions
We exploit di = pi − p′ i , the actual value of prediction di in Eq. (10), to guide the estimation of displacement d′ i .We design a Mean Squared Error (MSE) loss function which is defined as and we further discuss our choice regarding noise prediction in Sec. 5. We also require our denoised point cloud to be spaced out to avoid potential clustering issues.To achieve this, we design a repulsion loss that is inspired by prior literature [6,25,27] to assist with the distribution of the denoised points.First of all, we define a pseudo-clean point cloud P as where P does not actually have a clean appearance.Next, we query pseudo-clean patches from it where the process is defined as Here, we use pi ∈ P to perform queries for Pi as we assume pi is less impacted by noise, and normalise patch Pi using the same normalisation process introduced in Sec.3.2.Then, we use these patches to formulate our repulsion loss as Our final loss L is thus formulated as where γ is the factor controlling the repulsion loss and is set to 0.0005.We discuss the setting of γ in Sec. 5.

Training configurations
We carried out our experiments on an NVIDIA RTX 3080 GPU with 10 GB memory.We implemented our network model with PyTorch, and set the batch size for training to 128.We trained the network for 200 epochs with a single Adam optimiser.The learning rate was set to 0.0001 and was multiplied by a factor of 0.1 in the 40 th , 80 th , 120 th , 160 th , and 180 th epochs, respectively.

Training dataset
We adopted PUNet [38] for training, which contains 40 clean triangular mesh shapes and the corresponding noise-free point clouds in three different resolutions (10k, 30k and 50k, respectively).All clean point clouds were taken from the surfaces of the mesh shapes, and we sampled 1,000 patches per point cloud in each epoch.During training, we followed the noise addition method in [8,26], i.e., the additive noise is zero-mean Gaussian noise with a standard deviation ranging from 0.5% to 2.0% of each shape's bounding sphere radius.To achieve this, the clean point cloud should first be normalised into a unit sphere, and then added with noise Z ∼ D(µ D , σ 2 D ) where D is the objective noise distribution, µ D = 0, and σ D ∈ [0.005, 0.02].
We define the distribution of each level of the additive noise during training as A(µ A , σ 2 A ), and recall that N ∼ A and M ∼ A. To ensure N + M = Z holds, we need 2µ A = µ D and 2σ 2 A = σ 2 D .Thus, we set µ A = 0 and σ A ∈ [ 0.005 √ 2 , 0.02 √ 2 ] for noise sampling during our training process.

PUNet dataset
First, we evaluated the performance of our method on the PUNet test dataset.It contains point clouds sampled from 20 mesh shapes in 2 resolutions (sparse, 10k points per shape; and dense, 50k points per shape), and has 3 Gaussian noise levels (1%, 2% and 3% of the bounding sphere's radius) for each resolution.Following prior works [8,26], we ran 1 denoising iteration for 1% and 2% noise and ran 2 denoising iterations for 3% noise.We employed two metrics, Chamfer Distance (CD) and Point-to-mesh Distance (P2M), to measure the quality of the denoised point clouds.For both metrics,

Noisy
PCNet Non-local TotalDn DMR-U Score-U Ours Clean smaller values indicate more accurate results.We denote the unsupervised versions of DMR and Score using DMR-U and Score-U, respectively.The denoising results are shown in Table 1 where our method outperforms all others on each setting.

Kinect dataset
We also tested our method on real-world scanned data captured by Kinect v1 and v2 [39], which is inherently noisy.The datasets provide reconstructed clean data for quantitative measurement.We present the results in Table 2 where our method outperforms other unsupervised denoising techniques as well as the supervised method PCNet and the optimisationbased method Non-local in terms of CD and P2M errors.the denoised points from the ground-truth mesh surfaces, where blue indicates more accurate positions and yellow indicates inaccurate positions.The results are listed in Fig. 3 which includes both the sparse and the dense resolutions.As illustrated, the downsampling-resampling strategy of DMR-U does not effectively remove the residual noise, leading to substantial noisy points that are marked in yellow.Score-U tends to cluster the points together, resulting in uneven distributions.Other methods cannot effectively preserve small details, resulting in yellow regions in those areas.By contrast, our method achieves more accurate results overall on both sparse and dense configuration settings.

Kinect dataset
We also present the visual denoising effects on Kinect dataset in Fig. 4 where the denoised shapes are taken from the results presented in

Paris-rue-Madame dataset
We show denoising results on Paris-rue-Madame [40], a street scene point cloud dataset collected by laser scanners.
It is contaminated with severe noise due to outdoor environmental factors.No ground-truth data is associated with this dataset, so we compare the visual denoising effects in terms of smoothness and feature preservation.The results are shown in Fig. 5 where our method outputs smoother point clouds and effectively restores small details, as illustrated in the close-up windows.

Ablation study
We performed our ablation study on the validation dataset following the settings in [8] and measured CD and P2M metrics.In this section, we explore the effect of repulsion loss and our displacement regression technique.

Repulsion loss
We first validated the effects with and without the repulsion term L rep , and additionally evaluated with different γ values.
The results are shown in configuration number 1 to 6 in Table 3, where setting γ = 0.0005 achieves the best results among the options.

Displacement regression technique
An alternative training technique is regressing the displacement for noise M only and doubling it during the evaluation phase.Consequently, we should regress di in this setting and the altered training objective becomes where we define L ′ mse as We set γ to be consistent with our experiments.The results, displayed in configuration number 7 in Table 3, cannot achieve comparable results on the validation dataset.

Application
Denoised point clouds can be utilised for mesh reconstruction.To achieve this, we first exploited HSurf-Net [37] to estimate normals for the denoised point clouds produced by different methods, and then performed Poisson surface reconstruction [41].The denoised point clouds and the reconstructed meshes are shown in Fig. 6 where our method achieves smoother results on the mesh surface.

Discussion and conclusion
In this paper, we presented a novel unsupervised point cloud denoising framework that learns to restore clean point clouds from noisy inputs only.During training, we leverage two levels of noise, where the second noise level is twice that of the first, and train our network to predict clean points without using ground-truth data as labels.Extensive experimental results demonstrate that our method generalises well across synthetic and real-world noise, outperforming state-of-the-art unsupervised methods and achieving comparable performance to supervised methods.Our method unavoidably demonstrates some limitations during denoising.For instance, since our method is trained using point-based loss functions only, it may occasionally blur the sharp features and produce smooth edges and corners.In order to enhance the capability of preserving sharp features, it is necessary to utilise normal information as demonstrated in the prior literature [25,42].For our future works, we aim to explore further in terms of sharp feature preservation by incorporating normal information in our denoising process.

Fig. 1
Fig. 1 Point clouds P, P and P′ and their corresponding visibilities to the network.Our network can only access the noisy versions of the point clouds.Since M and N are i.i.d. and are both sampled from the distribution A, their expectation values are equivalent, i.e., E(M| P′ ) = E(N | P′ ).With that established, we can derive the following:

Fig. 3 Fig. 4
Fig.3The visual comparison results on a shape from PUNet dataset (1% noise) where yellow points are further from the ground-truth mesh surfaces.The top row shows the sparse (10k) configuration and the bottom row shows the dense (50k) configuration.

Fig. 5
Fig. 5 Comparison of the denoising results on Paris-rue-Madame dataset.

Fig. 6
Fig. 6 Mesh reconstruction results on the denoised point clouds.

Table 1
Quantitative results on PUNet's test set, where the best results are marked in bold.The abbreviations Op., Sup. and Unsup.represent optimisation-based, supervised, and unsupervised denoising methods, respectively.CD and P2M are both multiplied by 10 4 .

Table 2
Results on Kinect dataset.CD and P2M are both multiplied by 10 4 .

Table 2 .
It demonstrates that our method can produce smoother point cloud surfaces compared with other methods.

Table 3
Ablation study on validation set with different training configurations.CD and P2M are both multiplied by 10 4 .