1 Introduction

Optical phenomena are described using waves in wave optics [1]. However, image sensors detect only the intensity of a light wave and disregard its phase. Interferometric methods, such as digital holography, have been widely used to observe both the amplitude and phase of a light wave [2,3,4]. Such methods have been applied to label-free biomedical sensing, which is an example of quantitative phase imaging [5, 6]. A disadvantage of interferometric methods, however, is the need for bulky optics to introduce the reference light.

Diffraction imaging is an alternative method to measure the complex amplitude (amplitude and phase) of a light wave with an intensity pattern of the diffracting field and without reference light [7,8,9,10].

The inverse problem of recovering the phase from an intensity image is known as phase retrieval [11,12,13,14,15,16]. Diffraction imaging has been used for X-ray imaging because imaging optics and highly coherent light sources that work in this spectral regime are difficult to fabricate [17, 18]. Recently, phase retrieval techniques have also been introduced in the visible regime for lensless imaging and speckle-correlation imaging [19,20,21,22,23,24].

Phase retrieval algorithms basically employ an iterative process between the object and sensor domains to recover the phase from the intensity, and they require many iterations to achieve sufficient convergence [14, 15]. Machine learning, such as deep learning, has recently been used for robust and fast phase retrieval. Methods for machine-learning-based phase retrieval may be categorized into two approaches. In the first approach, a deep neural network (DNN) is used as a denoiser in the iterative phase retrieval process [25, 26]. This approach improves the stability during the iterations. In the second approach, a DNN is used to calculate the inverse function in the phase retrieval problem [27, 28]. The second approach enables faster non-iterative phase retrieval than the conventional methods, and it has been used for real-time diffraction imaging, imaging through scattering media, computer-generated holograms, wavefront sensing, and pulse measurement [27,28,29,30,31,32,33,34]. Also, such DNN-based inversion has been introduced to optical sensing methods other than phase retrieval [35,36,37,38].

In this paper, we numerically analyze and compare conventional phase retrieval and machine-learning-based non-iterative phase retrieval in terms of the noise robustness and the calculation time. We also demonstrate enhancement of the noise robustness in the machine-learning-based phase retrieval by use of a noisy training data set. Here, for simplicity while maintaining versatility, we assume positive, real objects, and phase retrieval from Fourier intensity measurements [14,15,16]. The fast, noise-robust phase retrieval based on machine learning demonstrated in this paper will contribute to various fields, including biomedicine, security, and astronomy.

2 Method

In the optical setup assumed in this study, an object field \(\varvec{g}\) propagates towards a sensor plane located in the far field and is captured as a single intensity image, as shown in Fig. 1. This imaging process is modeled as follows:

$$\begin{aligned} \varvec{I}=\left| \varvec{G} \right| ^2 =\left| \mathcal {F}[\varvec{g}] \right| ^2, \end{aligned}$$
(1)

where \(\mathcal {F}[\bullet ]\) is the Fourier transform, \(\varvec{G}\) is the Fourier spectrum of the object field via Fraunhofer diffraction, and \(\varvec{I}\) is the captured intensity image. Here we assume that the object is non-negative and real, which are general assumptions in the fields of astronomy and crystallography [14,15,16]. These assumptions have been used for enhancing the convergence and uniqueness of the solution in conventional iterative phase retrieval.

The inverse function of (1) is written as

$$\begin{aligned} \varvec{g}=\mathcal {H}[\varvec{I}], \end{aligned}$$
(2)

where \(\mathcal {H}[\bullet ]\) is the inverse function for phase retrieval. The inverse problem has been recently solved non-iteratively using machine-learning-based approaches [26,27,28,29,30,31, 33]. In the present research, we use a convolutional residual network called ResNet [27, 31, 39] for calculating the inverse function in (2) non-iteratively. ResNet is a known practical network architecture, and it has been used in various applications. ResNet utilizes residual learning by skip connections to prevent vanishing/exploding gradients during the training stage. In the case of non-iterative machine-learning-based phase retrieval, \(\mathcal {H}\) is regressed with ResNet by using a training data set. Two types of networks with different depths, as shown in Fig. 2, are investigated to compare their calculation costs and noise-robustness. The first network, which is called ResNet1 here, has one down-and-up-sampling process, as shown in Fig. 2a. The second network, which is called ResNet2 here, has two down-and-up-sampling processes, as shown in Fig. 2b. Here, “D” is a down-sampling block, as shown in Fig. 2c, “U” is an up-sampling block, as shown in Fig. 2d, and “S” is a convolutional block for a skip convolutional connection, as shown in Fig. 2e. The definitions of each layer are as follows: “BatchNorm” is batch normalization [40], and “ReLU” is a rectified linear unit [41]. “Conv(s, l)” and “TConv(s, l)” are, respectively, a 2D convolution and the transposed 2D convolution with a filter size s and a stride l.

Fig. 1
figure 1

Flow of imaging and reconstruction

Fig. 2
figure 2

Diagram of a non-iterative machine-learning-based phase retrieval method. a ResNet1, b ResNet2, c D-block, d U-block, and e S-block

Fig. 3
figure 3

Diagram of a conventional iterative phase retrieval method

In this paper, the error reduction (ER) method and the hybrid input-output (HIO) method are employed as the baseline of the conventional iterative phase retrieval, as shown in Fig. 3 [14]. In this case, the inverse function \(\mathcal {H}\) is iteratively processed as follows:

(1) Initial estimation of the Fourier spectrum \(\varvec{G}_{n}\)\(\varvec{G}_{n}=\varvec{A} \exp (j\varvec{\varPhi } _{n})\), where \(\varvec{A}\) is the amplitude which is set as \(\sqrt{\varvec{I}}\), \(\varvec{\varPhi } _{n}\) is the phase distribution, which is initially set randomly, and the subscript n is a counter of the iteration, which is initially set to one. (2) Calculation of the intermediately estimated object field \(\varvec{g}'_n\)\(\varvec{g}'_n=\mathcal {F}^{-1}[\varvec{G}_{n}]\), where \(\mathcal {F}^{-1}\) is the inverse Fourier transform. (3) Update of the object field \(\varvec{g}_{n}\): The estimated object field is refined with constraints on the object domain, which are mentioned in the next paragraph. (4) Calculation of the intermediately estimated Fourier spectrum \(\varvec{G}'_n\)\(\varvec{G}'_n=\mathcal {F}[\varvec{g}_{n}]\). (5) Update of the Fourier spectrum \(\varvec{G}_{n}\): The amplitude of the Fourier spectrum is replaced by \(\sqrt{\varvec{I}}\), and the counter n is incremented by one. Steps (2)–(5) are iterated until the object field converges.

In the case of the ER method, the update rule at Step (3) in the nth iteration is written as

$$\begin{aligned} \varvec{g}_{n}(x,y) = \left\{ \begin{array}{ll} \varvec{g}' _n (x,y), &{} \quad (x,y \notin \varvec{\eta }), \\ 0, &{} \quad (x,y \in \varvec{\eta }), \end{array} \right. \end{aligned}$$
(3)

where \(\varvec{\eta }\) is the set of all spatial positions that violate the constraints, and x and y are the lateral coordinates on the object plane. The update process of the HIO is written as

$$\begin{aligned} \varvec{g}_{n} (x,y)= \left\{ \begin{array}{ll} \varvec{g}' _n (x,y), &{} \quad (x,y \notin \varvec{\eta }), \\ \varvec{g}_{n-1} (x,y)-\beta \varvec{g}' _n (x,y), &{} \quad (x,y \in \varvec{\eta }), \end{array} \right. \end{aligned}$$
(4)

where \(\beta \) is a feedback parameter. In all numerical experiments in this study, the constraints on the object field are realness and non-negativity [14,15,16].

3 Analysis

The analyses were carried out by numerical simulations. The networks used here have different depths, as shown in Fig. 2. In addition, training data sets without and with noise were used for each of the networks to verify that their performance levels depended on the noise in the training data set. In the noisy training data set, white Gaussian noise was added to each of the captured images. The noise level was randomly set between 10 and 30 dB.

The object images were handwritten numbers randomly taken from the EMNIST database, as shown in Fig. 4a [42]. The pixel count of the original and captured images was \(28\times 28\). The training data set was composed of 200,000 pairs of the object and captured images, and the test data set was composed of 1000 pairs of the object and captured images, without any overlapping. A learning algorithm called Adam was used for optimizing the network with an initial learning ratio of \(1\times 10^{-5}\), a batch size of 100, and a number of epochs of 100 [43]. The loss function of the optimization was the mean squared error. The number of iterations in ER and HIO to achieve sufficient convergence was 200 [13]. The feedback parameter \(\beta \) in (4) was 0.9. The code was implemented in Python and Keras and was executed on a computer with an Intel Xeon 6134 CPU running at 3.2 GHz, with 192 GB of RAM, and an NVIDIA Tesla V100 GPU with 16 GB of VRAM.

The phase retrievals with the two networks when using the noisy and noiseless training data sets and the ER and HIO methods were compared under different measurement noise levels, namely, signal-to-noise ratios (SNRs) of 10, 20, 30 and \(\infty \) dB. The reconstructed results are shown in Fig. 4b. In the cases of ER and HIO, the ambiguity of the spatial shift and flip was compensated with a cross-correlation process. Some artifacts clearly appeared in the reconstructions of ResNet1, ER, and HIO. The reconstruction fidelity was evaluated with the normalized mean squared error (NMSE) between the original and estimated images as follows [44]:

$$\begin{aligned} \mathrm{NMSE} = \frac{\displaystyle \sum \nolimits _{x=1}^M \sum \nolimits _{y=1}^N {(\hat{\varvec{g}}(x,y)-\varvec{g}(x,y))^2}}{\displaystyle \sum \nolimits _{x=1}^M \sum \nolimits _{y=1}^N{\varvec{g}(x,y)^2}}, \end{aligned}$$
(5)

where \(\hat{\varvec{g}}\) is the estimated object image obtained through phase retrieval. Here, M and N are the number of elements along the x- and y-axes, respectively. The average NMSEs with the whole test data set are shown in Fig. 5. The results in Figs. 4 and 5 show that phase retrieval with ResNet2 realized accurate and robust reconstructions, even for noisy measurements, compared with ER and HIO. Also, the deep network (ResNet2) gave better estimated results than the shallow one (ResNet1). The NMSEs of the reconstructions with ResNet2 were about 1.6-times smaller than those of ResNet1, ER, and HIO, as shown in Fig. 5. Furthermore, the results show that the noisy training data set enhanced the noise robustness of the networks. In addition, ResNet2 with a noisy training data set of the handwritten numbers was applied to measurements simulated from the handwritten alphabetic characters in the EMNIST database [42]. The measurement SNR was 10 dB. In this case, the reconstruction NMSE was 0.55, which shows that the network was over-fitted to the handwritten numbers, although it was comparable to ER and HIO.

Fig. 4
figure 4

Simulation results. a Original and b estimated images

Fig. 5
figure 5

Relationship between the estimation errors (NMSEs) and the measurement SNRs

A plot depicting the influence of the number of training pairs on the estimation accuracy obtained with ResNet2 using the noisy training data set is shown in Fig. 6. This result shows that a larger number of training pairs trivially reduces the estimation error. In addition, using the noisy training data set, it was found that there is a high noise-robustness with ResNet2 at each number of training pairs. Figure 7 summarizes the calculation times of ResNet1, ResNet2, ER, and HIO. The results show that the non-iterative machine-learning-based phase retrieval method was about thirty-times faster than the conventional methods. Here, it should be noted that the evaluations of ER and HIO were performed with the CPU using NumPy in Python for a fair comparison. In these cases, the calculation time of the Fourier transform with the GPU using CuPy in Python was about ten-times longer than that with the CPU. This was not caused by a latency of data transfer between the CPU and GPU, but by an inefficient parallelism of the GPU depending on the image size as shown in Ref. [45].

Fig. 6
figure 6

Relationship between the number of training pairs and the NMSE of ResNet2 with the noisy training data set

Fig. 7
figure 7

Comparison of the calculation times

4 Conclusion

We numerically analyzed non-iterative phase retrieval algorithms based on machine learning in comparison with conventional iterative algorithms. The machine-learning-based algorithms used convolutional neural networks called ResNet with different depths. In the numerical comparisons, the deeper network (ResNet2) realized better reconstructions than those realized by the shallower network (ResNet1) and conventional iterative algorithms, which were ER and HIO. Furthermore, an improvement in the noise robustness of these networks was verified by learning with noisy training data sets. A large training data set, for example, where the number of training pairs was 200,000, reduced the reconstruction error in the machine-learning-based algorithms. The machine-learning-based algorithms were one order of magnitude faster than the conventional iterative ones.

As demonstrated in this paper, deep convolutional neural networks are promising solutions for the phase retrieval problem, in terms of accurate reconstruction, noise robustness, and calculation speed. Phase retrieval has a long history, and the range of its applications is significant and still growing. The machine-learning-based approach studied here may solidify the impact of phase retrieval in various fields, such as life sciences and materials science.