Representative real-world and synthetic tomograms
The real-world highly sampled tomogram used to train our networks was produced at the Diamond Light Source I13-2 beamline [46], and is used in [47]. The tomogram depicts a droplet of salt water on top of a 500 micron aluminium pin with magnesium deposits. It was captured at the end of a time-series of undersampled tomograms that forms a 4D dataset, itself part of a study to measure the corrosion of the metallic pin by the salt-water droplet over time [48]. Since the tomograms captured by Diamond Light Source are \(2160 \times 2560 \times 2560\) in size after reconstruction, even a single tomogram can in practice provide plenty of training instances after cropping. The use of a single tomogram (dataset) is a reflection of the limited availability of annotated data in this domain, where a set of multiple annotated tomograms cannot be a representative of future tomograms, due to the unique nature of the samples being imaged in every tomogram collection. This means that proposed networks in this domain would have to retrained on new representative highly sampled tomogram, that accompany the corresponding new time-series of undersampled tomograms. Manual annotation of these large \(2160 \times 2560 \times 2560\) highly sampled tomograms is a difficult and time-consuming process. As we note later, we incorporate the availability of only a very limited amount of training data into the core of our method, introducing synthetic data to help overcome this limitation.
The real-world, highly sampled tomogram, used in Sect. 5 is captured at the end of a time-series, and as such it has been possible to capture it with a high number of projections, namely 3601 instead of the 91 projections in the 4D dataset tomograms. Collection of these “book end” representative highly sampled tomograms is common practice for researchers dealing with 4D datasets. These tomograms are attained either before or after the capture of a 4D dataset, providing tomograms that can be used as reference, but without interfering with the time-critical part of the study. Even though not a part of the 4D dataset, they contain valuable information about the features of the samples under study which the researchers want to identify and label. Therefore, they can be used for training segmentation or denoising networks, which will later be applied on the reconstructions of the 4D dataset’s tomograms. In addition to such real-world, highly sampled datasets, synthetic, highly sampled tomograms are later utilised for the knowledge transfer experiments, as shown in Fig. 3. Such artificial datasets allow for better generalisation of the denoising and segmentation tasks, which elevates the accuracy of the networks. Furthermore, they are useful when the amount of annotated real-world tomograms is limited, which would otherwise not allow for the training of highly accurate networks.
For training, three highly sampled tomograms are used. Two of them are synthetically constructed and the other is a real-world tomogram. The real-world tomogram is reconstructed using the Savu Python package [49] and the synthetic tomograms using Kazantsev et al.’s [13] TomoPhantom software. For the synthetic tomograms, TomoPhantom constructs them using a file made by the user that lists a number of virtual objects placed in a virtual stage, in a similar way to how real-world objects are placed in a stage prior to them being CT scanned. The virtual objects can constructed to be of any shape formed from a predetermined list of geometrical shapes. The aforementioned file details the shape, size and orientation of the virtual objects as well as their virtual absorption rate, which will determine their brightness during the rendering of the synthetic reconstructions.
Tomogram reconstructions used for training
For each of the tomograms synthetic or real-world, two reconstructions are produced. The first are made using the Filter Back Projection method (FBP) [8] algorithm and all 3601 available projections. The resulting high-quality reconstructions [9, 41, 42] provide the ground truth used in the denoising experiments and in the denoising part of Stacked-DenseUSeg. The second reconstructions are made with the Conjugate Gradient Least-Squares method (CGLS) [7] reconstruction algorithm and uses only 91 projections (\(1\mathrm{st},41\mathrm{st},\ldots ,3601\mathrm{st}\) projections) ignoring some of the intermediate projections. These low-quality reconstructions [10, 43, 44], which are provided as input during training, have the same visual quality as the reconstructions from the undersampled tomograms in the 4D datasets, when only a small number of projections are used. The reason two different algorithms are used for the two reconstructions per tomogram is that the iterative CGLS method improves the reconstruction quality when applied to undersampled tomograms, while it offers poorer results than the more simple FBP method when applied to highly sampled tomograms. Figures 1, 3 provide cross sections of the two reconstructions for the real-world and synthetic tomograms, respectively, as well as their annotations, which depict the classes that the network aims to segment. The annotations [11] of the real-world tomogram are obtained using Luengo et al.’s SuRVoS [50] software tool using the high-quality reconstruction [9].
The synthetic tomograms’ annotations are predetermined [45]. Based on these the respective tomogram projections and reconstructions are created. In contrast to the real-world annotations, the synthetic annotations are not human estimations of the ground truth, rather they are the absolute ground truth. This is very significant since due to imaging artefacts, even in highly sampled tomograms, human annotators may unintentionally introduce bias to the data that will later be used for training, which in turn may reduce the accuracy of the final segmentation [51]. Synthetic datasets lack this potential bias, and by transferring knowledge from networks trained on synthetic data to networks that infer on real-world data, the latter may offer predictions in difficult cases which are potentially more accurate than human annotators. Two synthetic tomograms are used during experiments: one more realistic, with simulated rotating artefacts (Fig. 3a–d) that are also present in real-world data, and the other less realistic dataset, without such artefacts (Fig. 3e–h). This effectively gives us two levels of physical simulation quality to compare. For both tomograms, similarly to the real-world tomogram, there are two subsequent reconstructions: a high-quality reconstruction which provides the ground truth used in the denoising experiments and in the denoising part of Stacked-DenseUSeg (Fig. 3a,c,e,g), and a low-quality reconstruction used as input during training, because it has the same visual quality as the reconstructions from the real-world undersampled tomograms (Fig. 3b,d,f,h).
There are four classes to be segmented: the air outside the base material (Label 0 in Figs. 1c, 3i), the base material which is the water and aluminium (Label 1 in Figs. 1f, 3i), the magnesium deposits within the base material (Label 2 in Figs. 1f, 3i) and lastly the air pockets within the base material (Label 3 in Figs. 1f, 3i). In the synthetic tomograms, 5000 randomly oriented and sized virtual ellipsoids are randomly placed within a virtual cylinder for each class and assigned Labels 2 and 3. These simulate the magnesium deposits and air pockets, respectively, with the quantity chosen to roughly simulate the class balance present in the real-world data. The virtual cylinder in turn simulates the metallic pin found in the real-world data. Note that, due to the large number of ellipsoids introduced, there is occasional overlapping of ellipsoids which diversifies the shapes present in the tomograms (see Fig. 3c,g,i).
Since the training of the network is performed using a single tomogram at a time, they are normalised (if real-world data; synthetic tomograms are produced pre-normalised) and split into multiple non-overlapping \(64 \times 64 \times 64\) samples. Before entering the network each of the samples is either randomly rotated between 90, 180, 270 degrees, or horizontally or vertically mirrored. Of these samples, 70% are randomly chosen for training, 10% for validation and 20% for testing. This results in, for the real-world tomogram, 3723 samples used for training, 532 for validation and 1065 for testing. For the synthetic tomograms, 4646 samples are used for training, 663 for validation and 1329 for testing. The samples are cropped from the area of the tomograms that contain the desired classes to segment, because most voxels in the tomograms belong to Label 0 and sampling from the tomograms as a whole would create a great class imbalance. Thankfully, in most datasets (4D datasets or single tomograms) of this type [48] the region of interest which has to be segmented, is located in a single area of the tomogram. Using simple low-level imaging analysis techniques (edge detection, thresholding, etc.), it is possible to isolate this area, reducing the volume that the networks have to segment, and allowing for shorter inference times. During inference, the samples are chosen to be overlapping, and the final prediction is obtained by averaging predictions in areas of overlap. The predictions, that are the probabilities of which class each voxel belongs to, are used after averaging to infer the class of each voxel. Non-overlapping samples are used during training for clear separation between training, validation and testing samples. During inference, however, overlapping eliminates potential border artefacts in the output subvolumes and helps resolve potential uncertainties in the class of some voxels.
Hyperparameters and training settings
For backpropagation, the stochastic gradient-based optimisation method Adam [52] is used, with minibatches of 8. For DenseUSeg the learning rate is set to \(10^{-5}\), while for the Stacked-DenseUSeg the corresponding learning rates for both the segmentation part the denoising part of network are displayed in Fig. 4a. The reason for this duality of learning rates in the Stacked-DenseUSeg is that the denoising part has to train faster than the segmentation part, as the second is partially dependent on the output of the first. In all tested networks (ours and the state-of-the-art), a weight decay [53] of 0.0005 is used, that acts as a L2 regularisation which applies penalties when weights/parameters collectively get too large, which can lead to overfitting. Additionally, Fig. 4 displays the learning rate schedule regarding both the case of using a high amount of training data (Fig. 4a) and the case of low amount of training data (Fig. 4b).
Each epoch lasts for 300 minibatches and the network passes through 100 epochs during training. In order to verify the accuracy of the network variations presented in Sect. 5, every training session is repeated 3 times in a threefold fashion. For each of these iterations the 70-10-20 split between training, validation and testing samples, respectively, is kept; however, the samples for validation and testing are different with no common samples between each iteration. The weighted cross-entropy loss criterion is used for the segmentation output and the Mean Square Error (MSE) for the denoising one. The weights for the cross-entropy are calculated as:
$$\begin{aligned} W=median(P/S)/(P/S) \end{aligned}$$
(2)
where P is a vector of the pixel counts of the different classes and S is a vector that contains the number of samples of each class is present. The combined criterion for Stacked-DenseUSeg is:
$$\begin{aligned} L_{Combined}=L_{Cross\_Entropy} + \lambda L_{Mean\_Square\_Error} \end{aligned}$$
(3)
where \(\lambda \) balances the two losses and it is empirically set to 10. The \(L_{Cross\_Entropy}\) loss uses as ground truth the segmentation annotations while the \(L_{Mean\_Square\_Error}\) the high-quality reconstruction.
Metrics and execution time
As a segmentation metric, the Intersection over Union (IoU), also known as the Jaccard index, is used. Specifically, the IoU metric of class l is,
$$\begin{aligned} IoU_l=\frac{|\{v|v \in C_l\}\wedge \{p|p \in C_l\}|}{|\{v|v \in C_l\}\vee \{p|p \in C_l\}|}=\frac{TP_l}{TP_l+FP_l+FN_l}\nonumber \\ \end{aligned}$$
(4)
where \(v \in C_l\) are the voxels that belong to class l, are \(p \in C_l\) the predicted voxels that belong to class l. The IoU of class l can be calculated also from IoU’s second definition as seen in the second fraction of Eq. 4. \(TP_l\) is the total number of voxels predicted to belong to class l that actually do, \(FP_l\) is the total number of voxels predicted to belong to class l but do not, and \(FN_l\) the total number of voxels predicted to belong to any other class other than l, but they do in fact belong to class l. Therefore, the IoU of a class is penalised both by predicting voxels to belong to other classes when they belong to the class under consideration, and by predicting voxels to belong to this class when in fact they belong to others. This means that it is a good metric for segmentation, since it does achieve high values by overpredicting or underpredicting certain segmentation classes.
Furthermore, we also use as a segmentation metric, the F1 score, also known as the Sørensen–Dice coefficient (DSC). The F1 score of class l is,
$$\begin{aligned} F1_l=\frac{2|\{v|v \in C_l\}\wedge \{p|p \in C_l\}|}{|\{v|v \in C_l\}|+|\{p|p \in C_l\}|}=\frac{2TP_l}{2TP_l+FP_l+FN_l}.\nonumber \\ \end{aligned}$$
(5)
The F1 score and the IoU are similar as they both penalise the same elements described earlier, in a similar but not identical manner. We include the F1 score as it is a popular metric for the measurement of segmentation accuracy [5, 30] in certain domains, which some readers may be more familiar with.
For denoising, the Peak Signal to Noise Ratio (PSNR) metric (which is a logarithmic representation of the mean square error, see Eq. 6), and the Structural Similarity Index (SSIM) [54] (see Eq. 7) are used to quantitatively evaluate the image restoration quality compared to the ground truth (in our case the high-quality reconstructions [9, 41, 42]). Namely PSNR is,
$$\begin{aligned} PSNR({\varvec{D}},{\varvec{G}})= 10 \log _{10}\Bigg (\frac{1}{\Vert {\varvec{D}}-{\varvec{G}} \Vert ^{2} }\Bigg ) \end{aligned}$$
(6)
where \({\varvec{D}}\) is the denoised output of the network and \({\varvec{G}}\) the denoising ground truth. The PSNR metric represents the ratio between the maximum possible power of the signal, in our case is the low-quality reconstruction (produced with CGLS) of the high-sampled tomogram, and the power of noise that affects its fidelity and distance it from the ground truth, which in our case is the high-quality reconstruction (produced with FBP) of the high-sampled tomogram. PSNR is a metric of the signal-to-noise ratio and high PSNR values, expressed in the logarithmic decibel scale, signify higher restoration qualities. Additionally, the equation for the SSIM index is,
$$\begin{aligned} SSIM({\varvec{D}},{\varvec{G}})=\frac{\left( 2\mu _{{\varvec{D}}}\mu _{{\varvec{G}}}+(k_1L)^2\right) \left( 2\sigma _{{\varvec{DG}}}+(k_2L)^2\right) }{\left( \mu _{{\varvec{D}}}^{\,2}+\mu _{{\varvec{G}}}^{\,2}+(k_1L)^2\right) \left( \sigma _{{\varvec{D}}}^{\,2}+\sigma _{{\varvec{G}}}^{\,2}+(k_2L)^2\right) }\nonumber \\ \end{aligned}$$
(7)
where \(\mu _{{\varvec{D}}}\) is the mean of \({\varvec{D}}\), \(\mu _{{\varvec{G}}}\) is the mean of \({\varvec{G}}\), \(\sigma _{{\varvec{D}}}\) is the variance of \({\varvec{D}}\), \(\sigma _{{\varvec{G}}}\) is the variance of \({\varvec{G}}\), \(\sigma _{{\varvec{DG}}}\) is the covariance of \({\varvec{D}}\) and \({\varvec{G}}\), L is the dynamic range of the voxel-values, and by default \(k_1=0.01\) and \(k_2=0.03\). The SSIM index opposite to PSNR, which estimates the signal-to-noise ratio and visual quality based on absolute errors, is a perception-based metric. It considers image degradation as a cause of quality decrease, but ignores illumination and contrast alterations that do not cause structural changes on what it is imaged. It detects inter-dependencies between spatially close pixels, and by estimating how much of them remain unchanged between the networks’ denoising predictions and the ground truth, is able to measure the perceptual improvement in the quality.
Pseudocode for our method
Based on the information provided by the earlier subsections, Algorithm 1 shows the pipeline for the training, validation and testing processes of our network Stacked-DenseUSeg.
Algorithm 1 also describes how we train, validate and test DenseUSeg and the other networks that we use as baselines in the following section. The only difference is the omission of the input \({\varvec{D}}\) if the network will be used for segmentation or input \({\varvec{A}}\) if will be used for denoising. Also, depending on the operation (denoising or segmentation) the corresponding loss function is used (mean square error or weighted cross-entropy, respectively). Algorithm 2 describes the inference process for Stacked-DenseUSeg.
Training for 100 epochs takes approximately 18 hours in PyTorch [55] on 4 NVIDIA Tesla V100s. Naturally, a termination condition can be employed during training/fine-tuning with lower amounts of training samples, since the best performing epoch comes early (see Fig. 7c–e), and so reduce the time needed for training even further.