Introduction

Deep learning (DL) algorithms are a subdomain of artificial intelligence (AI) that uses a high generalization approach to recognize and interpret images,1 enabling an efficient identification of properties of materials.2 AI research has been used in two-dimensional (2D) materials to analyze their optical,2,3 physical,4 and electronic properties.5,6,7 Electronic properties, such as bandgaps and electron affinities, have been predicted using machine learning (ML) and DL models based on the structure–property relationship. Segmentation,8,9 thickness identification, and point defects10,11,12,13 have been analyzed based on DL modeling of the crystal structures and bandgaps of materials. A three-dimensional (3D) convolutional neural network called DL-enabled atomic layer mapping (DALM)14 has been studied to identify and segment MoS2 flakes with mono-, bi-, tri-, and multilayers. An encoder-decoder semantic segmentation network15 has been studied and configured for pixel-wise identification of optical images of 2D materials along with graphical features, such as contrast, color, edges, shapes, flake sizes, and their distributions. Similarly, a DL-based atomic defect detection framework (DL-ADD)13 has been demonstrated to efficiently detect atomic defects in MoS2 and generalize the model for defect detection in other TMD materials. The three DL architectures DenseNet,16 U-Net,17 and Mask-region convolutional neural network (RCNN)18 have been studied to classify, segment, and detect microscopic images of 2D materials for automated atomic layer mapping,3 while demanding many datapoints to train the networks in characterizing the optical images. A ML-based solution19 has been modeled to map simulation results from indentation pillar-splitting experiments and predict the critical indentation load of fracture instability using Gaussian process regression. Notably, image-to-image translation20,21,22,23 using generative conditional adversarial networks (cGANs) has been studied in translating optically sectioned structured illumination microscopy (SIM) images, semantic segmentation,24,25 and image processing.26 A game theory-based cGAN26 has also been demonstrated to predict physical fields such as stress or strain from the material microstructure geometry. While the cGAN works well with limited data to capture complex information from the pixels, the application of pix2pix in characterizing TMDs remains unexplored to date.

Here, we demonstrate a DL-based image-to-image translation approach with cGANs, trained with optical labeled images to enable intelligent characterization of mechanically exfoliated and CVD-grown TMDs. Unlike other AI-based research on TMDs, this method requires limited data to train and evaluate the model. To ensure that our DL model effectively learns the complex variations of pixels and accurately maps them to TMD thicknesses, we utilize experimental data obtained from Raman and PL spectroscopy. These data associate layer information with individual pixels and assign specific colors to represent different layers. We train a pix2pix model to generate the labeled images from optical images to identify the number of layers in TMDs and preprocess data for training. To assess the performance of the model, we conduct quantitative measurements using structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) scores. We further investigate the generalization ability of the model by training it on MoS2 and WS2 samples and successfully testing it on WSe2 samples, demonstrating its capability to adapt to different materials. Finally, we apply the model to characterize heterostructures, highlighting its ability to analyze complex material structures.

Results and discussion

Synthesis and characterizations of TMDs

Figure 1 illustrates the workflow of multimodal analysis of TMDs using DL-based cGANs. TMDs were transferred (via mechanical exfoliation) or synthesized (via low-pressure CVD [LPCVD]) on 300-nm SiO2/Si substrates (see the “Materials and Methods” section) with varying numbers of layers. To verify the growth of materials, we characterized the samples using Raman spectroscopy, photoluminescence (PL) spectroscopy, and atomic force microscopy (AFM) to determine the number of layers. Figure 1c-d gives the Raman and PL peaks of three and four layers as well as bulk (layers greater than four) mechanically exfoliated MoS2 as an example. All spectra were taken with 532-nm excitation. Figure 1c gives the E12g (in-plane vibration of Mo and S atoms) and A1g (out-of-plane vibration of S atoms) phonon modes of mechanically exfoliated MoS2. The phonon modes of MoS2 three layers (3L) are located at 382.02 cm−1 (E12g) and 405.54 cm−1 (A1g).

Figure 1
figure 1

Process of multimodal analysis of transition-metal dichalcogenides. (a) Optical image of mechanically exfoliated MoS2. (b) Labeled image of MoS2. (c) Raman spectra of three and four layers, and bulk MoS2. (d) PL spectra of three and four layers, and bulk MoS2. (e) Workflow of image-to-image translation using generative conditional adversarial networks.

Similarly, phonon modes of MoS2 bulk (layers greater than four) are located at E12g = 381.62 cm−1 and A1g = 407.09 cm−1. These peaks are blue-shifted from 405.54 cm–1 to 407.09 cm−1 for the in-plane and out-of-plane peaks because they move from three to bulk layers, directly proportional to the increased number of layers.27 The Raman shift between in-plane and out-of-plane peaks increases from ∼23.52 cm–1 for three layers (3L) to ∼25.47 cm–1 for the bulk (layers greater than four). 27 Figure 1d presents the PL spectra of mechanically exfoliated MoS2. As observed by others, the PL intensity for three-layer samples was much higher than the other two samples (four or thicker layers).27,28 Figure 1e shows the architecture of the cGAN model to characterize TMDs. The labeled images section shows labeled images based on the number of layers identified using Raman and PL spectra. A cGAN is comprised of a generator and a discriminator. The generator takes optical images as input and generates images, which are subsequently fed to the discriminator along with labeled images. The discriminator then compares both images and returns the output to the generator.

Data preprocessing

Multiple preprocessing steps were applied to the collected optical images to improve the image quality before the collected optical images were fed to the model. This procedure addressed the potential deterioration of images captured by an optical microscope, including uneven lighting and the gradual degradation of the camera sensor. First, we applied median filtering,29 a denoising technique, to smoothen the optical images and generate denoised images, and Gaussian filtering30 to smoothen the image further. The Gaussian average of neighboring pixels for each pixel can be calculated by

$${\text{Gaussian}}\left( {x,y} \right)\, = \,\frac{1}{{2\uppi \sigma^{2} }}e^{{ - \frac{{x^{2} \, + \,y^{2} }}{{2\sigma^{2}, }}}}$$
(1)

where x and y refer to the pixel location in the image. This process generated blurred images and removed high-frequency noise from the image. Figure 2a, d shows the denoised images of CVD-grown WS2 and exfoliated MoS2 after applying median and Gaussian filtering operations. We produced sharpened images by blending the denoised images with a positive weight of 1.5 and the blurred image with a negative weight of 0.5. This process enhanced the edges and other details in images. We normalized the pixel values of the sharpened image to the range (0, 255) calculated by

$$pixel\,Normalized\, = \,\frac{pixel\,Val - pixel\,Min}{{pixel\,Max - pixel\,Min}},$$
(2)

where pixel Val is the actual value of the pixel, and pixel Min and pixel Max are the minimum and maximum values of all pixels in the images, respectively. Figure 2b, e shows the final normalized images after applying median filtering, Gaussian filtering, sharpening and normalization processes. This process ensures that the data are within a consistent scale for a faster convergence during training, leading to less training time to improve the efficacy of the model’s generalization.

Figure 2
figure 2

Optical image preprocessing steps of chemical vapor deposition (CVD)-grown MoS2 (a, d) Denoised image of CVD-grown WS2 and exfoliated MoS2, respectively. (b, e) Normalized image of CVD-grown WS2 and exfoliated MoS2, respectively. (c, f, i) HSV (H-hue, S-saturation, V-value) color space of optical image in (a), (d), (g). (g) Optical image with reference line on pixels with substrate, monolayers, and bilayers. (h) Detection of flakes. (j) Color profile of reference line present in (g). (k) Area distribution of optical image (top-right).

Color-based segmentation for detecting TMDs

We further performed color-based segmentation to verify the presence of TMDs in optical images by converting optical images from RGB (R-red, G-green, B-blue) to HSV31 (H-hue, S-saturation, V-value) color space to separate the images into three components—hue, saturation, and value. Then, we created a mask based on the hue component values of the HSV color space, resulting in a mask containing the color information of the original images for pixels that meet the hue criteria, while setting other pixels to zero. Figure 2c, f, and i shows the HSV color space of the optical images, with the scale bars displaying the hue component of the images. Figure 2h shows the detected flakes bounded by a rectangle, while a reference line of pixels shown in Figure 2g contains pixels with substrate, monolayers, and bilayers. We generated color profiles of red, green, and blue channels of the pixels of the reference line. Figure 2j shows the color profiles from the reference line where the deviation of a red channel within the small circle represents the presence of a bilayer.

Data labeling

For the layer identification in TMDs, collected optical images were manually annotated using Labelbox,32 an open-source web-based labeling tool to annotate the data using the well-defined ontology. It provides a set of inbuilt web services that can be used to automate the process on a batch of data. Five different classes were used to define each image pixel from mono-, bi-, tri-, and four layers, as well as bulk (layers greater than four), and each class was labeled as one specific color. The monolayer pixels, bilayer pixels, pixels with three layers, four layers, and layers greater than four were colored in blue, green, red, cyan, and light gray, respectively. Figure 3a shows a set of labeled images. Figure 3b–c shows an optical image and the labeled image of mechanically exfoliated MoS2. Figure 3d shows the mask images for each class and the legend used for coloring the image pixels. Figure 3e–f depicts 3D plots of pixel intensities to visualize the pixels before and after labeling. Once labeled, each image was paired with its respective labeled image to be fed into the model (Figure 4).

Figure 3
figure 3

Optical image labeling. (a) Grid of labeled optical images. (b) Optical image of MoS2. (c) Manually labeled image of MoS2. (d) Mask images of each layer in the image (c). (e) Scatterplot of pixels of the optical image in red-green-blue color space shown in the image (c). (f) Scatterplot of pixels of the labeled image.

Figure 4
figure 4

Model architecture of the generative conditional adversarial network. The architecture includes a generator and discriminator in an adversarial training framework trained on NVIDIA GeForce GTX 1080 graphics processing unit (GPU). The generator (based on U-Net architecture) transforms input images into labeled images with a resolution of 780 × 588 pixels. The discriminator (based on PatchGAN architecture) distinguishes between actual labeled images and generated images and provides output as 0 (fake) or 1 (real) that is subsequently backpropagated to the generator.

Model architecture and training

The model architecture of the pix2pix model, a cGAN designed explicitly for image-to-image translation, comprises two models: a generator and a discriminator. The generator takes an optical image as input and transforms it into another image, which, along with the corresponding labeled image, is fed into the discriminator model, comparing the similarity between both images. The generator model is an encoder-decoder model that is based on U-Net17 architecture. The encoder encodes the input image and extracts the features while the decoder maps the pixels to the size of the image. The discriminator model is based on PatchGAN23 architecture, which provides the binary output as 0 or 1 to indicate whether the generated image is fake or real. PatchGAN focuses on discriminating local image patches rather than the entire image. This approach allows for finer-grained analysis of image details and facilitates more precise feedback to the generator. In image-to-image translation tasks like ours, where optical features, including contrast variations, thickness variations, and colors, are crucial, PatchGAN’s localized discrimination helps capture intricate features accurately. PatchGAN enables the discriminator to produce high-resolution outputs by operating on image patches. In our case, where the goal is to generate detailed annotations for optical images of TMDs, the ability to produce high-resolution output maps is essential for preserving image quality and capturing optical features. The generator and discriminator models are stacked together to update each other dynamically during training. The generator model is updated continuously to minimize the loss of the generated images from the discriminator model. This loss is known as adversarial loss, which is also updated to minimize the loss between the generated image and the labeled image calculated by

$$\begin{gathered} G_{loss} \, = \,Adversarial\,loss\, + \,\left( {\uplambda \, \times \,L1\,loss} \right) \hfill \\ G_{loss} \, = \,\frac{1}{m}\sum\limits_{i = 1}^{m} {\log \left( {1 - G\left( x \right)} \right)\, + \,\left( {\uplambda \, \times \,MAE\left( {G\left( x \right),y} \right)} \right),} \hfill \\ \end{gathered}$$
(3)

where λ is hyperparameter, L1 loss is the mean absolute error (MAE) between the generated and labeled images, y is the labeled image, G(x) represents the generated image, Gloss represents the generator loss, and Adversarial loss is the sigmoid cross-entropy loss. Similarly, discriminator loss is calculated by

$$D_{loss} \, = \,\frac{1}{m}\sum\limits_{i = 1}^{m} {\log \left( {1 - y} \right)\, + \,\log \,\left( {G\left( x \right)} \right),}$$
(4)

where Dloss is discriminator loss, y is the labeled image, and G(x) represents the generated image. Overall, the conditional generative adversarial network loss tries to maximize discriminator loss and minimize the generator loss simultaneously to generate the required result. The combined loss function is given by

$$Combined\;los{{s}_{G,D}}={{E}_{x,y}}\left[ \log D\left( x,y \right) \right]+{{E}_{x}}\left[ \log \left( 1-D\left( x,G\left( x \right) \right) \right) \right],$$
(5)

where Ex,y is the expected value of the logarithm of D(x,y) and D(x,y), which represent discriminator output when a pair of input images and its corresponding labeled image are provided as input, Ex is the expected value of the logarithm related to D(x,G(x)), which represents discriminator output when a pair of input images and generator output image are provided. The final loss function of the network is calculated by

$$G*,\,D*\, = \,\arg \mathop {\min }\limits_{G} \,\mathop {\max }\limits_{D} \,Combined\,loss_{G,D} \, + \,\left( {\uplambda 2*MES\left( G \right)} \right),$$
(6)

where MSE(G) is the mean squared error loss of the generator, λ2 is a hyperparameter, and the min and max functions represent minimizing the generator loss and maximizing the discriminator loss, respectively, at the same time.

Prediction and performance evaluation

A set of 100 preprocessed optical images and their corresponding manually labeled images were fed to the model for training. Before feeding the image to the model, we resized it to 512 × 512 pixels to ensure compatibility with the model architecture. Additionally, we employed augmentation techniques, such as flipping and rotation, to further enhance the training data set’s diversity and robustness. We chose the adaptive moment estimation (Adam)33 optimizer, a method of efficient stochastic optimization that requires first-order gradients with minimal memory requirement. We chose a learning rate 0.0002, a first momentum term (β1) set to 0.5, and a second momentum term (β2) set to 0.999 for the Adam optimizer. We initially set 250 as the total number of training iterations(epochs) with a batch size of 2 and stopped training at 200 epochs after observing a negligible difference between the manually labeled image and the final predicted image. Figure 5a shows the input image, the manually labeled image as ground truth, and the generated image at different epochs. At epoch 10, the generated image is not very similar to the ground truth, and the quality is not good.

Figure 5
figure 5

Model training, loss, and accuracy. (a) Training examples showcasing input images, corresponding ground truth, and model-generated output images. (b) Loss curves depicting the training progression of the generator and discriminator in a pix2pix model. (c–e) Evaluation metrics to compare the generated result with the ground truth using structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE), respectively.

Similarly, at epoch 100, the generated image looks similar to ground truth, but there are still some overlaps between the predicted annotation layers. The model’s training was terminated voluntarily due to a negligible difference between manually labeled and generated images at epoch 200. Figure 5b depicts the loss curves during training, where the top curve illustrates the generator GAN loss, which is the adversarial loss of the generator. The generator aims to minimize this loss by generating images indistinguishable from the real labeled images. The second curve shows the feature matching loss, which indicates the stability and quality of the generated image. The bottom two curves represent the discriminator’s fake loss and real loss, which measures how well the discriminator classifies the generated (fake) image as fake and the generated (real) image as real.

To quantitatively assess the quality of generated images, we computed SSIM,34 PSNR,35 and MSE36 scores for each pair of labeled and generated images during training. Figure 5c–e illustrates the plots of these score values for each training iteration, respectively. The SSIM compares the structural information in labeled and generated images by considering luminance, contrast, and structure. It produces a value between −1 and 1, where 1 indicates a perfect match, and 0 indicates no similarity. The PSNR compares the noise level and image distortion between images. A higher value indicates better image quality. The MSE calculates the cumulative squared difference between labeled and generated images where a value near 0 indicates a perfect match and a higher value indicates the dissimilarity between images. Table S1 in the Supporting information shows exemplary ML-based studies,8,14,15,23,37,38 compiling numerical comparison of methods utilized.

Additionally, to find the minimum amount of data needed for training, we trained and evaluated the cGAN model using different data set sizes, including 50, 75, and 100 optical images. Figure 6 shows the impact of data set size on the performance of the model. Figure 6a–c displays the discriminator loss curves of the model trained on 50, 75, and 100 optical images, respectively. Notably, when trained with 100 images, there is a clearer overlap between the discriminator’s fake and real loss curves compared to the other data set sizes. This overlap suggests that the generator has effectively learned to produce images that closely resemble real ones, making them challenging for the discriminator to distinguish.

Figure 6
figure 6

Impact of data set size on model performance. (a–c) Loss curves depicting the training progression of the discriminator in a pix2pix model trained with 50, 75, and 100 optical images, respectively. (d–f) Evaluation metrics to compare the generated result with the ground truth in a model trained with 50, 75, and 100 optical images using the peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and mean squared error (MSE), respectively.

Figure 6d–f shows the computed PSNR, SSIM, and MSE scores for each pair of labeled and generated images during training, corresponding to models trained on 50, 75, and 100 optical images. For the SSIM metric, denoted in red, the model trained with 100 images exhibits the highest values, approaching 1. Similarly, the PSNR metric demonstrates its peak value for the model trained with 100 images. Regarding the MSE metric, its value should decrease during training, ideally approaching 0. This trend is observed in the model trained with 100 images, where the MSE values are closer to 0, indicating superior performance in image translation compared to models trained on fewer images.

Model generalization

We further investigated the model’s generalization by training it on MoS2 and WS2 while testing it on WSe2. The model successfully identified the layers, demonstrating its capability to identify the number of layers across different materials accurately. To evaluate the model’s performance, we tested it on multiple test images of CVD-grown and mechanically exfoliated samples of MoS2, WS2, and WSe2. Figure 7a displays input images and their corresponding predicted images from the model. The generated images are labeled by the model with different colors, each indicating a specific number of layers. Figure 7b shows the ability of the model to analyze heterostructures, where the white dotted triangles in the input image indicate the stacking of WS2 on top of the MoS2, which was classified as bilayers in our model. The bottom part of Figure 7c displays the Raman spectra, where the phonon modes of MoS2 are located at E12g = 380.35 cm−1 and A1g = 399.98 cm−1, and the phonon modes of WS2 are observed at E12g = 349.89 cm−1 and A1g = 414.22 cm−1. As part of the training data set, Raman and PL data were utilized to identify the number of layers in different regions of optical images. The Raman data confirm that the trained model can identify the number of layers in the heterostructures.

Figure 7
figure 7

Model generated results for mono-, bi-, three, four, and bulk flakes, as well as heterostructures (WS2/MoS2). (a) Generated output for three different examples from top to bottom. (b) Generated output for heterostructures (WS2/MoS2) along with Raman spectra. (c) WS2/MoS2 data show four peaks, indicating bilayer (monolayer of WS2 stacked on a monolayer of MoS2). Similarly, MoS2 and WS2 data show the presence of MoS2 and WS2 monolayers, respectively.

Unlike other AI-based research on TMDs, the DL-based image-to-image translation method we introduced here does not require a large amount of data to train and evaluate the model. As shown in Figure 7, despite being trained with limited data, our model can identify the number of layers in various TMD types, demonstrated with MoS2, WS2, and WSe2. Moreover, our method works on the heterostructure, demonstrating the generalizability of the model. This approach can be further extended to characterize other TMDs, illustrating the adaptability and scalability of the model, as demonstrated in Figure 7. The adversarial relationship between the generator and discriminator helps to achieve better results in multimodal transformation-related tasks. It can extract information from the complicated distributions of the data.

Figure 8 illustrates the comparison between model-generated results and manually labeled images. We utilized various CVD-grown and mechanically exfoliated samples as test samples and produced the results accordingly. In Figure 8, the first column displays the input optical image, the second column depicts the ground truth, which is the manually labeled image, the third column exhibits the model-generated image, and the fourth column showcases the absolute error, representing the difference between the images in columns 2 and 3.

Figure 8
figure 8

Model prediction results and comparison with ground truth (manually labeled optical image) for mono-, bi-, tri-, four-, and bulk flakes. The absolute error, representing the difference between manually labeled (ground truth) images and model-generated images, is negligible.

Conclusion

We have demonstrated a DL-based pix2pix cGAN network to identify and characterize TMDs with different layer numbers, sizes, and shapes. The DL-based pix2pix cGAN network was trained using a small set of labeled optical images, translating optical images of TMDs into labeled images that map each layer with a specific color and give a visual representation of the number of layers. As part of the data preprocessing, multiple segmentation techniques were implemented to extract graphical features from the optical images, including contrast, color, shapes, flake sizes, and their distributions. Furthermore, the trained model was adapted to characterize other 2D materials not initially included in the data set. The performance of the model was assessed by multiple metrics, including SSIM, PSNR, and MSE scores. In contrast to deep convolutional neural networks, our findings demonstrate an ability to overcome the challenge of lack of generalization when trained with smaller data sets. Our model is solely based on optical images, capturing complex variations, categorizing layers into five different classes and demonstrating adaptability across a diverse range of materials.

Materials and methods

Sample preparation

TMDs were synthesized via LPCVD. Prior to the growth of MoS2, a thin MoO3 layer was prepared using physical vapor deposition of MoO3 onto a Si substrate with 300-nm-thick thermal oxides. Another SiO2/Si substrate contacted the MoO3-deposited substrate face-to-face. MoS2 was grown onto the SiO2/Si substrate. For the growth, the furnace was heated up with a ramping rate of 18°C min−1 and held for 15 min at 850°C. During the heating procedure, an argon gas (30 sccm) was supplied at 300°C; a hydrogen gas (15 sccm) was delivered at 760°C. Sulfur was supplied when the furnace temperature reached 790°C. After the growth, a few millimeters size of MoS2 monolayers were obtained. Similarly, for the growth of WS2, we used WO3 instead of MoO3. As the furnace was ramped at 15°C/min, the reaction proceeded by reducing WO3 by hydrogen and subsequent sulfurization of the WO3. The growth temperature was 900°C. Ar gas was introduced from 150°C to reduce moisture and ambient gas, and H2 gas was supplied from 650°C (increasing temperature) to 700°C (decreasing temperature). Before the growth of WSe2, the SiO2 substrate was dipped in a 10% KOH solution for 3 min to increase surface energy, followed by a deionized water treatment. The growth temperature was set at 850°C. Ar gas was introduced at 50°C, and H2 gas was supplied at 650°C. Selenium was supplied when the furnace temperature reached 590°C. After the growth, a few millimeters size of WSe2 were obtained. MoS2 and WSe2 crystals were also mechanically exfoliated onto SiO2/Si substrate using adhesive tape. WS2/MoS2 heterostructures were fabricated by transferring WS2 on CVD-grown MoS2: As grown WS2 flakes on Si/SiO2 substrate were coated with a thin layer of PMMA 950 A4 using a dropper, then left in air at RT for 2 h to drive off the solvent. The chips were floated in 30% KOH (aq); after 10–40 min, the Si chip fell to the bottom, leaving the PMMA + WS2 square floating on the surface. The PMMA was cleaned in filtered DI water and blow-dried with RT air. The PMMA + WS2 was successfully transferred to CVD-grown MoS2 by placing the WS2 side down and removing PMMA with warm acetone at 60°C for 30 min. The samples were rinsed with warm acetone and then annealed in an ultrahigh vacuum chamber on top of a button heater (Heatwave Laboratories) for 4 h at 350°C to fully remove any residual polymer contamination.

Data acquisition

We utilized an optical microscope, PL and Raman spectroscopy to characterize the existence of layers of the deposited TMDs. Specific areas consisting of crystals with different layer numbers were captured using an optical microscope. Raman and PL spectra were obtained with a 532-nm excitation laser and a laser power of <500 μW to avoid damage to the samples. A spectral grating with 1800 lines/mm was used for both measurements. We chose a 100× objective lens to capture the images.14

Data processing

The mask of the optical image is calculated by

$$mask\, = \,uppermask\, \times \,lowermask\, \times \,valuemask,$$
(7)

where the value of the uppermask and lowermask depend on the hue component, and the valuemask depends on the value component of the HSV images. The resultant mask was converted to gray scale to create binary images. Here, pixels with values greater than 0 were set to true, while pixels with values equal to or less than 0 were set to false. Each connected component was assigned a unique label, and the resulting labeled image was displayed using a colormap to visualize different flakes present in the image.

Model setup and training

The pix2pix model was implemented using the TensorFlow open-source DL package. Over 35 optical images of WS2 and 65 images of MoS2 were used for the training data set. A set of optical images and their corresponding manually labeled images were used to train the model. The training was performed on a system with an NVIDIA GeForce GTX 1080 graphics card with CUDA version 10.1. It took 3 h to train the model for 200 epochs on 100 image data pairs, and the training was stopped once the difference between the actual labeled image and generated image became negligible. We choose 200 as an optimized number of epochs based on the observation.