1 Introduction

Deep learning (DL) is one of the fastest growing domains dominating and significantly accelerating the advancements of different aspects of industrial development and academic research [1]. The generative pre-trained transformer 3 (GPT-3) [2], which has recently gained immense attention, is one of the latest achievements that epitomizes the increasingly successful applications of DL in sophisticated tasks, such as natural language processing. Similar achievements have also been demonstrated in artificial generated images, such as DALL-E [3], computer vision with the transformers-like architectures [4] and in playing games on the superhuman level such as AlphaZero [5].

To this end, neuromorphic photonics are among the most promising approaches, with recent layouts already paving a realistic road map toward femtojoule per MAC efficiencies [6], potentially allowing one to perform MAC operations with a very low-energy consumption (\(10^{-15}\) joules). These architectures are capable of transmitting optical signals near to the speed of light—significantly exceeding their electronic counterparts—which are then manipulated to provide the functionality of neurons [7, 8]. Several approaches have been proposed to this end, employing both purely optical devices [9] and advanced high-speed electro-optical substances [10, 11]. Such architectures provide ultra-high computational rates since they operate in very high frequencies, exploiting their massive parallelism capabilities arising from their enormous bandwidth [12,13,14]. On top of that, they are operating in very low-energy and power consumption envelopes [15], making them appealing in DL applications and especially in applications that require high-frequency and low-energy consumption, such in fiber network communications [16] and network monitoring [17, 18].

Although electrooptical substances provide a fast and efficient platform for DL, they also introduce various noise sources that impact the effective bit resolution. More specifically, photonic computing includes the employment of digital-to-analog (DAC) and analog-to-digital (ADC) conversions along with the parameters encoding, amplification and processing devices, such as modulators, photodiodes (PDs) and amplifiers, that, inevitably, introduce degradation to the analog precision during inference, since each constituent introduces a relevant noise source that impacts the electro-optic link’s bit resolution properties. Thus, the noise introduced increases when higher line rates were applied, translating to lower bit resolution. Several enhancements have been proposed to this end, based on the fact that neural networks can compensate for noise during inference if they are first trained to withstand them [19, 20]. Such approaches range from taking into account the limited bandwidth of optical neurons [21, 22], simulating noise using white Gaussian noise [23] to more advanced schemes, such as initializing the networks taking into account the noise and data distribution [24]. Furthermore, existing approaches deal with the noise impairments originating from limited precision substances and AD/DA conversions, by applying post-quantization [25] or quantization-aware training [26, 27] techniques, significantly improving the performance of models [28].

Typically, the degradation introduced to analog precision can be simulated through a quantization process that converts a continuous signal to a discrete one by mapping its continuous set to a finite set of discrete values [29]. This can be achieved by rounding and truncating the values of the input signal. Despite the fact that quantization techniques are widely studied by the DL community [30,31,32], they generally target large convolutional neural networks (CNNs) containing a large number of surplus parameters with a minor contribution to the overall performance of the model [33, 34]. These large architectures are easily compressed, in contrast to smaller networks, such as those currently developed for neuromorphic photonics, in which every single parameter contributes greatly to the final output of the model [30]. Furthermore, existing works focus mainly on partially quantized models that ignore input and bias [30, 35]. These limitations, which are further exaggerated when high-slope photonic activations are used, dictate employing different training paradigms that take into account the actual physical implementation [36]. Indeed, neuromorphic photonics impose new challenges on the quantization of the DL model, requiring the appropriate adaptation of existing methodologies to the unique limitations of photonic substrates, e.g., using photonic activation functions. Furthermore, the quantization scheme applied in neuromorphic photonics typically follows a very simple uniform quantization, because it depends on the DAC/ADC modules that quantize the signals equally and symmetrically [37, 38]. This differs from the approaches traditionally used in trainable quantization schemes for DL models [39], as well as mixed-precision quantization [40, 41].

Fig. 1
figure 1

Bit requirements for each layer if AlexNet8 based on conducted experiments (Sect. 4.1) are depicted. A normal distribution can be used to attach the probability of layer-wised bit resolution reduction in a mixed-precision scheme

Being able to operate in lower-precision networks during deployment can further improve the potential use of analog computing by increasing the computational rate of the developed accelerators, while keeping the energy consumption low [42, 43]. In this work, we focus on recently proposed high-speed analog photonic computing [44] that unlocks dynamic precision capabilities for photonic neural networks (PNNs) [45, 46]. We propose a stochastic mixed precision quantization-aware training scheme that is capable of adjusting the bit resolutions among layers in a mixed-precision manner, based on the observed bit resolution distribution of the applied architectures and configurations. More specifically, it gradually reduces the bit resolution of layers, attaching a higher probability for lower bit resolutions to intermediate layers than to outer ones, following a normal distribution, exploiting and evaluating the observation that intermediate layer bit precisions can be significantly reduced without negatively affecting the performance of the model, while dramatically decreasing the inference time [45]. The proposed quantization-aware training method takes into account the quantization noise that arises from the uniform quantization of the learnable parameters, inputs, and activation values, while gradually reducing the bit resolution based on their position and the training epoch. We claim that the shape of the distribution of bit resolution within a network is an inverted bell and, therefore, we proposed a normal distributed probability to reduce bit resolutions, as conceptually depicted in Fig. 1.

The proposed method enables us to lower the average bit resolution of models in a gradual and mixed-precision manner, without significant performance degradation. As a result, it provides us the capability of exploring and applying lowered precision configurations, increasing the computation rate in contrast to fixed precision models. To demonstrate the effectiveness of the proposed method, we applied it in a wide range of architectures in various applications, ranging from image classification to financial time-series forecasting, employing photonic activation functions. Based on the photonic architecture proposed in [44], we employed a framework, which is proposed in [45], to quantitatively evaluate the inference time of lower bit precision models, highlighting in this way the benefits of applying the proposed mixed-precision quantization-aware training approach. This paper is an extended version of our preliminary work presented in [26], which proposed a quantization-aware training method, oriented to PNNs, that takes into account the quantization noise introduced from uniform quantization. In this work, we further extend our previous work, proposing a mixed-precision approach that gradually reduces the bit resolution of layers during the quantization-aware training, taking into account their relative position in the artificial neural network (ANN) and the training epoch.

The rest of this paper is structured as follows. Section 2 provides the necessary background on photonic DL, while the proposed method is introduced and described in Sect. 3. Finally, the experimental evaluation is provided in Sect. 4, while the conclusion is drawn in Sect. 5

2 Background

Similarly to the software-implemented ANNs, photonic ones are based on perceptrons with the ultimate goal of approximating a function \(f^*\). More precisely, the input signal of the photonic ANN is denoted as \(\varvec{x} \in {\mathbb {R}}^{N_{i-1}}\), where \(N_{i-1}\) represents the number of input features at \(i\)-th layer. Each sample in the train dataset is labeled with a vector \(\varvec{t} = \varvec{1}_c \in {\mathbb {R}}^{C}\), where the \(c\)-th element equals to 1 and the other elements are 0 if it is a classification task (C denotes the number of classes) or a continuous vector \(\varvec{t} \in {\mathbb {R}}^{C}\) if it is a regression task (C denotes the number of regression targets). MLPs approximate \(f^*\) by using more than one layer, i.e., \(f^{(n)}(...(f^{(2)}(f^{(1)}(\varvec{x};\varvec{\theta }^{(1)})\varvec{\theta }^{(2)};)\varvec{\theta }^{(n)})= \varvec{z}^{(n)}\) and learn the parameters \({\varvec{\theta }}^{(i)}\) where \(0 \le i \le n\) with \({\varvec{\theta }}^{(i)}\) consisting of the weights \(\varvec{w}^{(i)} \in {\mathbb {R}}^{N_i \times N_{i-1}}\) and biases \({\textbf {b}}^{(i)} \in {\mathbb {R}}^{N_i}\). Subsequently, each layer’s output is denoted as:

$$\begin{aligned} \varvec{z}^{(i)}= f^{(i)}(\varvec{y}^{(i-1)})=\varvec{w}^{(i)}\varvec{y}^{(i-1)}+\varvec{b}^{(i)} \in \mathbb {R}^{N_i}. \end{aligned}$$
(1)

The output of the linear part of a neuron is fed to a nonlinear function \(g(\cdot )\), named activation function, to form the final output of the layer:

$$\begin{aligned} \varvec{y}^{(i)} = g(\varvec{z}^{(i)}) \in \mathbb {R}^{N_i}. \end{aligned}$$
(2)

The training of an ANN is achieved by updating its parameters, using the backpropagation algorithm, aiming to minimize a loss function \(J(\varvec{y}, \varvec{t})\), where \(\varvec{t}\) represents the training labels and \(\textbf{y}\) the output of the network. Cross-entropy loss is often used in multi-class classification cases:

$$\begin{aligned} J(\varvec{y}, \varvec{t}) = -\sum _{c=1}^{C}t_c\log y_c \in \mathbb {R} \end{aligned}$$
(3)

Except from feedforward ANNs, in this paper, we also employ the quantization methods to a simple-to-apply recurrent neuromorphic photonic architecture [47, 48]. The applied recurrent architecture benefits from the existing photonic feedforward implementations [49] while using a feedback loop. Following the above notation and the fact that the recurrent architectures accept sequential data as input, let \(\textbf{x}\) be a multidimensional time series, while let \(\textbf{x}_{t} \in \mathbb {R}^{N_{in}}\) denote \(N_{in}\) observations fed to the input at the t-th time step. Then, the input signal is weighted by the i-th neuron using the input weights \(\textbf{w}_i^{(in)} \in \mathbb {R}^{N_{in}}\). Furthermore, the recurrent feedback signal, denoted by \(\textbf{y}^{(r)}_{t-1} \in \mathbb {R}^{N_r}\), which corresponds to the output of the \(N_r\) recurrent neurons at the previous time step, is also weighted by the recurrent weights \(\textbf{w}_i^{(r)} \in \mathbb {R}^{N_r}\). The final weighted output of the i-th recurrent neuron is calculated as:

$$\begin{aligned} \varvec{u}_{ti}^{(r)} = {\textbf{w}_i^{(in)}}^T \textbf{x}_{t} + {\textbf{w}_i^{(r)}}^T \textbf{y}^{(r)}_{t-1} \in \mathbb {R}^{N_r}. \end{aligned}$$
(4)

It should be noted that we omitted the bias term to simplify the employed notation. Then, this weighted output is fed to the employed photonic nonlinearity \(f(\cdot )\) to acquire the final activation of the neuron as:

$$\begin{aligned} \varvec{y}^{(r)}_{ti} = f(\varvec{u}^{(r)}_{ti}) \in \mathbb {R}^{N_r}. \end{aligned}$$
(5)

In this case of study two photonic activation functions are used. First, the photonic sigmoid activation function is defined as [50]:

$$\begin{aligned} g(z) = A_2+\frac{{A}_{1}-{A}_{2}}{1+e^{(z-{z}_{0} )/d}} \in \mathbb {R}, \end{aligned}$$
(6)

in which the parameters \(A_1 = 0.060, A_2=1.005, z_0=0.154\) and \(d =0.033\) are tuned to fit the experimental observations as implemented on real hardware devices [50].

Also, a photonic sinusoidal activation function is applied on the experimental test evaluations. The photonic layout corresponds to employing a Mach-Zehnder Modulator device (MZM) [51] that converts the data into an optical signal along with a PD [52]. The formula of this photonic activation function is the following:

$$\begin{aligned} g(z)=\left\{ \begin{array}{ll} 0, &{} \hbox {if }z<0,\\ \sin {\frac{\pi ^2}{2}z}, &{} \hbox {if }0<z<1,\\ 1, &{} \hbox {if }z>1.\\ \end{array} \right. \end{aligned}$$
(7)

It is worth noting that, because of the narrow range of the input domain these photonic activations have, training is even more difficult, since the networks tend to be easily saturated, leading to slower convergence or even halting the training.

Fig. 2
figure 2

a PNN noise sources breakdown, b PNN axon bandwidth versus equivalent bit resolution

To evaluate the proposed method, we use a novel analytical framework, proposed in [45], which is capable of correlating the available optoelectrical bandwidth of the underlying photonic components with the corresponding equivalent bit resolution of a PNN. In this way, we are able to identify the major physical mechanisms that define the relationship between the computational rate and the achievable bit resolution of the PNN. In turn, this reveals the latency-precision trade-off of high-speed PNNs, following the paradigms of electronic ANN accelerators [53, 54]. Figure 2a depicts a schematic breakdown of the noise sources that impact the bit resolution performance of the PNN, i.e., *********\(\sigma _{\rm RIN}\) correlated with the noise contribution of the laser source, \(\sigma _{\rm MM}\) corresponding to the noise introduced by the photonic matrix multiplication circuitry, \(\sigma _{\rm shot}\) that corresponds to the random fluctuation of the photodiode’s current, \(\sigma _{\rm dark}\) correlated to the finite dark current of the photodiode, \(\sigma _{\rm ADC}\) that corresponds with the quantization noise imposed by the employed ADC and finally \(\sigma _{T}\) correlated with the thermal noise of the PNN. Utilizing the aforementioned framework in the scope of a typical neuromorphic photonic layout with an insertion loss of 30dB, which can approximate the characteristics of a high-scale PNN deployment when referenced to a laser emitted power of 16dBm, and considering the normalized standard deviation of the noise of the matrix equal to \(\sigma _{\rm MM}=10^{-3}\) and the remaining noise sources equal to the standardized values proposed in [45], we present the relationship between the achievable number of effective bits on a PNN axon versus the bandwidth employed in Fig. 2b.

In this work, we quantitatively evaluate the proposed method on the inference time, by extracting the number of multiply accumulate operations (MACs) needed. The inference time is calculated according to bit resolution and millions of multiply accumulate operations (MMACs) of each layer. To calculate the bandwidth per PNN’s axon according to noise equivalent bits, we fit an exponential function, \(s(\cdot ): \mathbb {R} \rightarrow \mathbb {R}\), to the experimental data obtained in [45] measuring the PNN configuration of [44] with the formula given by:

$$\begin{aligned} s(x') = a + b e^{-c x' + d} \end{aligned}$$
(8)

where \(a=0.82\), \(b=35.07\), \(c=1.68\) and \(d=4.40\) are the coefficients and \(x' = min\{2.4, x\}\). We clipped \(x\) to constrain the maximum available bandwidth according to hardware specifications. The final execution time of the model in seconds is calculated as:

$$\begin{aligned} T(\varvec{c}, \varvec{r}) = 10^{-3} \sum _{i=1}^{n}\frac{c_i}{s(r_i)} \in \mathbb {R}_{+}, \end{aligned}$$
(9)

where \(c_{i}\) and \(r_{i}\) are the number of MMACs and equivalent bit resolution of \(i\)-the layer, respectively.

3 Proposed framework

In this work, we propose a framework for quantization-aware training that gradually reduces the bit resolution of layers of the network, taking into account their position and training epoch. The proposed framework is oriented to PNNs and more specifically to the recently proposed dynamic precision architectures [44], but can also be used out of the box for other neuromorphic architectures, without loss of generality.

3.1 Quantization-aware training

The proposed quantization-aware training framework takes into account the quantization error that arises from the limited precision modules. The quantization-aware scheme exploits the intrinsic ability of ANNs to compensate for known noise sources when they are first trained to withstand them [24]. More specifically, the network is trained with quantized parameters by applying uniform quantization to all parameters involved during the forward pass and consequently the quantization error is accumulated and propagated through the network to the output and affects the employed loss function. In this way, the network is adjusted to lower-precision signals, making it more robust to reduced bit resolution during inference, significantly improving model performance. Under the proposed mixed-precision quantization-aware training framework, which is inspired and extends the quantization scheme in [28], every signal that is involved in the response of the \(i\)-th layer is first quantized in a specific floating range \([h^{(i)}_{\rm min}, \dots , h^{(i)}_{\rm max}] \in \mathbb {R}\). Then, during the forward pass the network, quantization error \(\varvec{\epsilon }\) is injected to simulate the effect of rounding during the quantization, while during the backpropagation the rounding is ignored and approximated with an identity function. In this way, the backpropagation process can be performed without any major change to the existing training pipelines, since the input, weights, model parameters and activation values are stored in floating point format during the training. Therefore, our proposed method belongs to the so-called straight thought estimator (STE) quantization family [41].

More specifically, every involved signal is first mapped to the corresponding quantization bin, based on the used bit resolution and assuming that each bin has the same length. Therefore, assuming that \(h^{(i)} \in \mathbb {R}\) represents a value of a signal, e.g., input, weight or output, the quantized version of the signal can be obtained by applying a quantization function \(Q(\cdot ): \mathbb {R} \rightarrow \mathbb {N}\) as:

$$\begin{aligned} h_{q}^{(i)}& = Q(h^{(i)}, s_h^{(i)}, \zeta _h^{(i)}) \nonumber \\& = clip \bigg (\bigg \lfloor {\frac{h^{(i)}}{s_h^{(i)}} + \zeta _{h}^{(i)}}\bigg \rceil , q_{\rm min}^{(i)}, q_{\rm max}^{(i)}\bigg ) \in \mathbb {N}, \end{aligned}$$
(10)

where \(s_h^{(i)} \in \mathbb {R}^{+}\) is the scale factor for the specific signal, \(\zeta _h^{(i)} \in \mathbb {N}\) is the zero point, \(q_{\rm min}^{(i)} \in \mathbb {R}^+\) and \(q_{\rm max}^{(i)}\in \mathbb {R}^+\) denote the range of a \(r^{(i)}\)-bit positive integer, i.e., \(q_{\rm min}^{(i)}=0\) and \(q_{\rm max}^{(i)}=2^{r^{(i)}}\), while the clip function is defined as:

$$\begin{aligned} {\rm clip}(x, m, M) = \max (\min (x, M), m). \end{aligned}$$
(11)

The scale value is computed as:

$$\begin{aligned} s_h = \frac{h^{(i)}_{\rm max}-h^{(i)}_{\rm min}}{h_{\rm max}^{(i)}-h_{\rm min}^{(i)}} \in \mathbb {R^{+}}, \end{aligned}$$
(12)

while \(h^{(i)}_{\rm max} \in \mathbb {R}\) and \(h^{(i)}_{\rm min} \in \mathbb {R}\) represents the proxy maximum and minimum of \(\varvec{h}^{(i)} \in \mathbb {R}^N\) signal. In turn, the zero point is calculated:

$$\begin{aligned} \zeta _h = clip\bigg \{ \bigg \lfloor {q_{\rm min}^{(i)} - \frac{h_{\rm min}}{s_h}}\bigg \rceil , q_{\rm min}^{(i)}, q_{\rm max}^{(i)}\bigg \} \in \mathbb {N}. \end{aligned}$$
(13)

Then, we can convert back the \(h_{q}^{(i)} \in [0, \ldots , 2^{r^{(i)}}-1]\) value to its floating point representation \(h_{f}^{(i)} \in [h^{(i)}_{\rm min}, \ldots , h^{(i)}_{\rm max}]\) using the dequantization function \(D(\cdot ): \mathbb {N} \rightarrow \mathbb {R}\) as:

$$\begin{aligned} h_f^{(i)} = D(h^{(i)}_{q}, s_h^{(i)}, \zeta _h^{(i)}) = s_{h}^{(i)} (h_{q}^{(i)}-\zeta _{h}^{(i)}) \in \mathbb {R}. \end{aligned}$$
(14)

Following the above notation, the quantized response of the \(i\)-th layer of a network, before applying the activation function, can be calculated as

$$\begin{aligned} \textbf{z}_{f}^{(i)} = Quant(\varvec{w}_{f}^{(i)} \varvec{y}^{(i-1)}_f + \varvec{b}^{(i)}_f) \in \mathbb {R}^{N_i}, \end{aligned}$$
(15)

where \(Quant(\varvec{x}) = D(Q(\varvec{x}, s_x^{(i)}, \zeta _x^{(i)}), s_x^{(i)}, \zeta _x^{(i)})\) denotes the process of quantization followed by the dequantization of a vector \(\varvec{x}\in \mathbb {R}^{N_{i=1}}\) and can be applied element-wise on vectors, while \(\varvec{w}^{(i)}_{f} \in [w^{(i)}_{\rm min}, \ldots , w^{(i)}_{\rm max}]^{N_i \times N_{i-1}}\) and \(\varvec{b}^{(i)}_{f} \in [b^{(i)}_{\rm min}, \ldots , b^{(i)}_{\rm max}]^{N_i}\) denotes the quantized weights and biases of the \(i\)-th layer. Finally, the output \(\textbf{z}_{f}^{(i)}\) passes through the photonic activation function, \(g(\cdot )\) of the neuron:

$$\begin{aligned} \textbf{y}^{(i)}_{f} = Quant(g( \varvec{z}_{f}^{(i)})) \in \mathbb {R}^{N_i}. \end{aligned}$$
(16)
Fig. 3
figure 3

Figure shows a linear neuron with injected quantization noise during forward pass

Therefore, all signals involved in neuron output are distributed in a uniform floating range between \(h_{\rm min}^{(i)}\) and \(h_{\rm max}^{(i)}\) and they can be represented by \(b^{(i)}\) bits. Thus, the quantization error is propagated through the network as a noise signal that is considered during the training process. This can be trivially implemented by injecting quantization errors during the forward pass as follows:

$$\begin{aligned} \varvec{y}^{(i-1)} = g^{(i)}((\varvec{w}^{(i)} + \varvec{\epsilon }_w) (\varvec{y}^{(i-1)} + \varvec{\epsilon }_y) + \varvec{b}^{(i)} + \varvec{\epsilon }_b) + \epsilon _g \in \mathbb {R}^{N_i}, \end{aligned}$$
(17)

where \(\varvec{\epsilon _w}\), \(\varvec{\epsilon _y}\) and \(\varvec{\epsilon _g}\) are the weight, linear response and activation quantization errors that are calculated as the difference between the original value and the quantized one obtained using the \(Quant(\cdot )\) function. For example, for the weights \(\varvec{w}\), the quantization error is calculated as \(\varvec{\epsilon }_w = Quant(\textbf{w}) - \textbf{w}\). This process is illustrated in Fig. 3 for the feedforward networks. It is worth noting that without loss of generality and the same quantization framework can be applied to other architectures as well, such in recurrent ones. We should note that during the training, the quantization effect is simulated, while backpropagation happens as usual, meaning that the original parameters are updated according to the propagated loss.

The proxy minimum (\(h_{\rm min}\)) and maximum (\(h_{\rm max}\)) values of a signal \(\varvec{h} \in \mathbb {R}^N\), in which the scale of the uniform buckets depends on, can significantly affect quantization noise, both on training and inference. A large value for \(h_{\rm max}\) (or a small value for \(h_{\rm min}\), respectively) originates from an outlier of a signal, causing a significant portion of signal values to fall within the same buckets and in turn inducing information loss and high quantization noise. To this end, we propose a method to adjust the \(h_{\rm min}\) and \(h_{\rm max}\) values during training using exponentiation moving average (EMA). EMA enables elimination of outliers in vectors and matrices, smoothing in this way the quantization-aware training process. Therefore, models become more robust to outliers during the training and, in turn, more robust to quantization noise injected into signals during the inference.

The EMA is applied iteratively in every training iteration since the distribution of every signal is transformed during the training. To this end, the proxy minimum (\(h_{\rm min}^{(i)}\)) and maximum (\(h_{\rm max}^{(i)}\)) values of a signal \(\varvec{p}^{(i)} \in \mathbb {R}^N\) calculated in every training iteration \(t\) as:

$$\begin{aligned} \tilde{h}^{(i)}_{{\rm max}, t}& = (\beta /t) h^{(i)}_{{\rm max}, t} + (1 - (\beta /t))\tilde{h}^{(i)}_{{\rm max}, t-1} \end{aligned}$$
(18)
$$\begin{aligned} \tilde{h}^{(i)}_{min, t}& = (\beta /t) h^{(i)}_{min, t} + (1 - (\beta /t))\tilde{h}^{(i)}_{min, t-1} \end{aligned}$$
(19)

where \(\beta\) is the weighting parameter of the EMA and the update is applied for \(t>\lceil \beta \rceil\).

3.2 Mixed-precision quantization-aware training

In addition to quantization-aware training, we proposed a gradual approach to reduce the bit resolution of the layers depending on their positions and the training epoch. The proposed method takes into account the expected or observed distribution of bit requirements of every layer of the network and accordingly reduces their bit resolution in a mixed-precision manner during the training. More precisely, in this work, we claim that intermediate layers require lower bit resolutions, in contrast to the first and last layers, which are susceptible to quantization noise and negatively affect the performance of the model when lower bit resolutions are applied. In the case of the first layers, we claim that according to the data processing inequality, the information lost in one layer cannot be recovered in subsequent ones [55]. On the other end, the last layers of the network are dedicated to project the extracted features to the multidimensional output space that affects the predictions made by the model. During the training, especially when the cross-entropy loss function is used, similar samples (for example, digits of 2 and 5 in the MNIST dataset) are placed near in the hyperspace, in contrast to the significant dissimilar samples that are placed far the one from the other (for example, digits of 1 and 5 in the MNIST dataset). Thus, models are more susceptible to noise when applied to the last layer(s), especially when similar samples are included to the evaluation dataset, reducing in this way their performance.

To this end, we attach to the middle layers a higher probability of bit resolution reduction during the training, while for the outer layers, we attach a smaller probability. The probabilities are distributed to the layers based on a Gaussian distribution, and in every epoch they are increased according to the introduced hyperparameters. Conceptually, the proposed method is motivated by the roulette wheel selection approach that is widely used in genetic algorithms (GAs) [56]. In contrast to GAs that attach static probabilities to the chromosomes based on their fitness values, the proposed method attaches probabilities to layers that are normally distributed, depending on layer’s positions, and gradually increases them during the training.

Fig. 4
figure 4

In the top row of the figure, the maximum probability densities for different values of hyperparameter \(\tau\) are presented. In the second row, the sliced maximum probability densities for layers 2 and 3 are presented according to the hyperparameters of each column. The number of slices of the maximum density of each layer depends on the hyperparameter \(\delta\). The layers start from lower opacity slices to higher ones, representing in this way the probability increment

More precisely, for each layer, we define the maximum probability of reducing its bit resolution. If a layer reaches the highest probability, then the probability remains constant either until the end of training or until a bit reduction occurs. If a bit reduction is performed in the \(t\)-th epoch, in the next epoch it will start from the lowest probability. More specifically, at every epoch \(t\) we draw such values \(\varvec{u} \in \mathbb {R}^{n}\) as the number of layers from respective Gaussian distributions \(\varvec{u} \sim \left[ \mathcal {N}^{(0)}(\mu , \sigma ), \ldots , \mathcal {N}^{(n)}(\mu , \sigma ) \right]\), where \(n\) denotes the number of layers, \(\mu\) and \(\sigma\) defines the mean and standard deviation of the Gaussian distributions, respectively, and they are hyperparameters of the proposed method. Every layer \(i\) is assigned a maximum probability of reducing its bit resolution, which depends on its position, and it can be calculated with the cumulative distribution function (CDF) of the standard normal distribution as:

$$\begin{aligned} p^{(i)}_{\rm max} = \frac{1}{\sqrt{2\pi }}\int _{a^{(i)}}^{a^{(i+1)}}e^{-z^2/2}dz \in [0, 1], \end{aligned}$$
(20)

giving the probability density that lies in the range between \(a^{(i)}\) and \(a^{(i+1)}\), where \(\varvec{a} \in \mathbb {R}^{n+1}\) defined as \(\varvec{a} = [-\tau , \ldots , \tau ]\) expressing the uniform slices of Gaussian distribution in \(z\)-axis, as depicted in the top row of Fig. 4. The \(\tau\) represents the width of the slices, affecting the maximum probability of each layer. For example, by increasing the \(\tau\), the maximum probabilities of the inner layers increased, while decreasing the probabilities of the outer layers. Respectively, when \(\tau\) value decreased, maximum probabilities of the inner layers were reduced and, consequently, those of the outer layers increased.

To this end, if the layer is on its maximum probability, the mixed precision algorithm checks if the point \(u^{(i)}_t \sim \mathcal {N}^{(i)}(\mu , \sigma )\), drawn from the respective Gaussian distribution, is within the range, \(u^{(i)}_t \in [a^{(i)},a^{(i+1)})\), with \(p^{(i)}_{{\rm max}, t}\) probability at \(t\)-th timestep. If \(a^{(i)} \le u^{(i)}_t < a^{(i+1)}\) and \(r^{(i)} > r_{\rm min}\), where \(r_{\rm min} \in \mathbb {N}_{+}\) defines the minimum allowed bit resolution, then the bit resolution of \(i\)-th layer is dropped down to \(r^{(i)}_{t+1} = r^{(i)}_t - r_{\rm step}\), where the \(r_{\rm step} \in \mathbb {N}_{+}\) defines the bit reduction step. This can be expressed as:

$$\begin{aligned} r_{t+1}^{(i)}=\left\{ \begin{array}{ll} max\{r_{\rm min}, r_t^{(i)} - r_{\rm step}\} &{} \text {if } a^{(i)} \le u^{(i)}_t \le a^{(i+1)}\\ r_{t}^{(i)} &{} \text {otherwise}. \end{array} \right. \end{aligned}$$
(21)

After introducing a stochastic approach that gives in a systematical way more possibility for bit reduction in layers that not crucially affect the performance of the network, based on both the theoretical and experimental observations [41], we also introduce a gradual way of doing it, during the quantization-aware training. Giving the full range of \([a^{(i)},a^{(i+1)})\) in \(i\)-th layer to reduce their bit resolution and, thus, the maximum possibility of \(p^{(i)}_{\rm max}\) in every epoch, it will result in very low precision in the very first few epochs of the training. As a result, quantization noise will be significantly high in the first epochs of training where the network is unstable and not close to their convergent stage, causing difficulties in training, such as vanishing gradient phenomena [57]. This can lead to a bad local minimum and significant performance degradation. To this end, we propose a gradual bit reduction method that in every epoch increases the probability of layers to reduce their resolution. This is achieved by uniformly slicing the maximum possible range, \([a^{(i)},a^{(i+1)})\), of \(i\)-th layer anew, as depicted in the second row of Fig. 4.

More specifically, we introduce a divisor value \(\delta \in \mathbb {N}_{+}\) that slices the maximum available range of \(\varvec{A}^{(i)} = [a^{(i)},a^{(i+1)})\) in \(\delta\) uniform slices as:

$$\begin{aligned} \varvec{\Delta }^{(i)}& = \left[ a^{(i)}, a^{(i)} + \frac{a^{(i+1)} - a^{(i)}}{\delta },\ldots , a^{(i)}\right. \\{} & {} \left. + \frac{(\delta -1)\left( a^{(i+1)} - a^{(i+1)}\right) }{\delta }, a^{(i+1)}\right] \in \mathbb {N}^{\delta }_{+}, \end{aligned}$$

where the \(\varvec{\Delta }^{(i)}\) is called active range of \(i\)-th layer.

At every epoch \(t\), the active range is increased either until its maximum available range, given by \(\varvec{\Delta }^{(i)}_{\delta } = [a^{(i)},a^{(i+1)})\), which is equivalent to the maximum probability density, \(p^{(i)}_{\rm max}\), of \(i\)-th layer, or until the point drawn from the respective Gaussian distribution \(u^{(i)}_t \sim \mathcal {N}^{(i)}(\mu , \sigma )\) drops within the active range \(u^{(i)}_t \in \varvec{\Delta }^{(i)}_{j_t^{(i)} }\). The \(j_t^{(i)} \in [1, \ldots , \delta ]\) defines the epoch passed either from the start of training or from the last bit reduction. Therefore, at every training epoch \(t\) that the drawn value is not within the active range, \(u^{(i)}_{t} \notin \Delta ^{(i)}_{j_t^{(i)} }\), the \(j_t^{(i)}\) increased by a unit \(j_{t+1}^{(i)} =min\{j_{t}^{(i)}, \delta \}\) if it is not equal to \(\delta\), increasing in this way the active range \(\varvec{\Delta }^{(i)}_{j+1}\) by another a slice, expressed by

$$\begin{aligned} \varvec{\Delta }^{(i)}_{j_{t}^{(i)}}= \left\{ \begin{array}{ll} \left[ a^{(i)} + \frac{(\delta -j^{(i)}_t)(a^{(i+1)} - a^{(i)})}{\delta }, a^{(i+1)}\right) &{} a^{(i)} \ge 0 \\ \left[ a^{(i)} + \frac{j^{(i)}_t(a^{(i+1)} - a^{(i)})}{\delta }, a^{(i+1)}\right) &{} \text {otherwise}. \\ \end{array} \right. \end{aligned}$$
(22)

The two cases define the probability increase for both sides of the normal distributions, as shown in the second row of Fig. 4. If the range of the layer is on the negative side of the distribution, then the probability increases by unlocking the next slice on its right. Otherwise, if the layer is on the positive side of the distribution, the probability is increased by decreasing the lower bound to unlock another slice. In both cases, the probability of the \(i\)-th layer performing a bit reduction is increased, \(p^{(i)}_{j+1} \ge p^{(i)}_{j}\). In the case where \(u^{(i)}_t \in \Delta _{j}^{(i)}\), the active range is reset to its minimum, \(\varvec{\Delta }^{(i)}_{j=1}\), and the method performs a bit reduction according to \(r^{(i)}_{t} = {\rm max}\{r^{(i)}_{t-1} - r_{\rm step}, r_{\rm min}\}\). This procedure enables us to smoothly increase the injected quantization noise, while ensuring that the noise will be first introduced in layers that are more robust to its effects. The probability of the \(i\)-th layer to reduce its precision by \(r_{\rm step}\) bits at \(t\)-th epochs is given by:

$$\begin{aligned} p^{(i)}_{t} = \left\{ \begin{array}{ll} {\rm min}\left\{ p^{(i)}_{\rm max}, \frac{1}{\sqrt{2\pi }}\int _{\varvec{\Delta }_{j_t^{(i)}}}e^{-z^2/2}dz\right\} &{} \text {if } b_{t}^{(i)} < b_{\rm max},\\ 0 &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(23)

where \(j\) is calculated at every epoch \(t\) as:

$$\begin{aligned} j_{t+1}^{(i)}=\left\{ \begin{array}{ll} 1 &{} \text {if } u_{t} \in \varvec{\Delta }_{j_t^{(i)}},\\ min\{j_t^{(i)} + 1, \delta \} &{} \text {otherwise}, \end{array} \right. \end{aligned}$$
(24)

with \(\{j_1^{(i)}=1 \mid 1< i < n \}\).

In this work, we used the default hyperparameter values for all experimental evaluation cases. More specifically, we empirically propose using a normal distribution with 0 mean and variance equal to 1, \(\mathcal {N}(\mu =0,\sigma =1)\), while setting \(\tau =3\). The gradual increment of probability for bit reduction of each layer depends on \(\delta\), and in turn, we set equal to \(\delta =M/4\), where \(M\) is the number of epochs. Finally, the bit reduction step is set to \(r_{\rm step}=2\). It should be mentioned that the proposed method is presented for an even number of layers for simplicity, but it can be used straightforwardly for an odd number of layers simply by shifting the normal distribution right or left.

figure a
figure b

The proposed framework is presented algorithmically in Algorithms 1 and 2. Algorithm 1 presents the mixed-precision quantization-aware training, while Algorithm 2 presents the quantization of the parameter vector or matrix. More specifically, Algorithm 1 takes as input the number of epochs \(M\), trainable parameters for each layer (meaning weights and biases), initial and minimum bit resolution, as well as the hyperparameters of mixed-precision quantization-aware training, such as \(r_{\rm step}\), \(\tau\) and \(\delta\). First, the algorithm properly constructs the slices that are used for the maximum probability of each layer (line 2) and attaches to them the minimum active probability (lines 4–5) and the initial bit resolution (line 6). The network is then trained in a quantization-aware manner (lines 7–21), by quantizing every parameter involved during the forward pass (lines 8–19) applying the proposed gradual bit reduction approach, calculating for the \(t\)-th epoch the bit resolution, \(r_{t}^{(i)}\), of each layer (lines 8–15). First, the proposed method draws a value, \(u_{t}^{(i)}\), for each layer, from the respective Gaussian distribution (line 9) and checks whether it is within the active range of the layer (lines 10–15). In cases where the \(u_{t}^{(i)}\) is within the active range (lines 10–12), the bit resolution is reduced if necessary (line 11) and, in turn, resets the active probability to its minimum (line 12). In case where the random value is not in the active range (lines 13–15), the active probability is increased, depending on whether there is on their maximum probability or not.

After that, quantization-aware training can proceed as usual. The quantization of parameters, which is explained in Algorithm 2, is applied to: \(\varvec{w}_{f}^{(i)}\) (weights of the \(i\)-th layer), \(\varvec{b}^{(i)}_{f}\) (biases of the layer) and \(\varvec{y}^{(i-1)}_{f}\) (inputs of the layer) and then to the outputs \(\varvec{z}_{f}^{(i)}\) for each layer \(i\) during the forward pass. Then, backpropagation updates the original weights and biases by applying the loss function \(J(\varvec{y}_f, \varvec{t})\). More specifically, in Algorithm 2 the quantization of a matrix or vector parameter, \(\varvec{p}\), is presented. Initially, we compute \(p^{\rm min}\) and \(p^{\rm max}\) use EMA (lines 2–3) which is used to compute scale and zero point (lines 4–5). In turn, \(\varvec{p}\) is quantized in a set of finite integers between \([0, \ldots , 2^{r^{(i)}_{t}}]\) (line 6) and the back to the floating range \([\tilde{p}^{\rm min}, \ldots , \tilde{p}^{\rm max}]\) (line 7) which is the final output of the algorithm.

4 Experimental evaluation

To demonstrate the effectiveness of the proposed method, we first conduct experiments to investigate the bit requirements for each layer based on its position, showing that intermediate layers require lower bit resolutions, in contrast to outer ones. In turn, we present experimental evaluation results showing the huge inference time benefits of applying mixed-precision quantization techniques. After that, we evaluate the proposed method by applying it to different neural network architectures and configurations, including Multi-Layer Perceptrons (MLPs), CNNs and Recurrent Neural Networks (RNNs). For all applied architectures, we evaluate the performance of the proposed method using two different photonic configurations, based on photonic sigmoid and sinusoidal activation functions, presented in Sect. 2. More specifically, we demonstrate its capabilities in two image classification tasks, using the MNIST and CIFAR10 dataset, and in a high-frequent financial time series forecast, using the FI2010 dataset [58]. We report experimental results for average bit resolutions ranging from 2 to 7 bits, acquired for multiple evaluation runs. Finally, in the case of the CIFAR10 task, we demonstrate the efficiency of the proposed method in terms of inference time, applying the evaluation framework introduced in Sect. 2. For benchmarking, we are using two uniform quantization baselines, (a) a post-training quantization method that using the minimum and maximum values of a matrix to uniformly distribute the distinct values and (b) the quantization-aware training method, presented in Sect. 3, using EMA during the training to calculate the proxy minimum, \(h_{\rm min}\), and maximum, \(h_{\rm max}\), values.

4.1 Mixed-precision evaluation

Fig. 5
figure 5

Figure depicts the mixed-precision requirements on 3 different convolutional architectures. On the top row, the relation between accuracy and bit resolution for each layer is presented. On the bottom row, we plot the minimum bit requirements without significant accuracy degradation (\(<1\%\))

We experimentally demonstrate the bit requirements of each layer, by investigating different bit resolution configurations in 3 well-known convolutional architectures. More specifically, we applied LeNet5 [59], AlexNet8 [60] and ResNet9 [61] on CIFAR10 traditionally used image classification task, and we trained them employing AdamW [62] optimizer for 100 epochs with learning rate equal to 0.001 and weight decay value equal to \(10^{-5}\). After compiling 5 training runs for each different architecture, we evaluate them on different bit resolutions, starting to employing them from first to last layer, applying the simplest post-training quantization method. The quantization method uniformly quantizes the parameters using for \(h_{\rm min}\) and \(h_{\rm max}\) the minimum and maximum values, respectively. We present the average accuracy over 5 evaluation runs for each bit resolution between 2 and 7 bits, applying on each layer of the network. This allows us to investigate the relation between each layer’s bit resolution and the overall performance of the model.

The experimental results are presented in Fig. 5. More specifically, in the upper row, we plot the average evaluation accuracy for different bit resolutions when applied on different layers. The dotted red line presents the average accuracy, over 5 evaluation runs, when we apply standard 32-bit floating arithmetics. The different lines represent the different layers of the models. In the bottom row, we report the minimum bit resolution requirements for each layer, without a significant accuracy drop, \(<1\%\). In cases where it is not possible to catch the acceptable accuracy, we report the maximum available bit resolution, which is 7 bits. We should note that our analysis does not take into account the inner dependencies of bit requirements, meaning the effects on bit resolution on one layer when the previous one also has reduced bit resolution, since we apply bit reduction only to one layer at a time.

In the first row of the figure, we can distinct that the first and last layers of the network either start from a lower performance, than the intermediate layer, or their performance drops significantly even in lower bit reductions. For example, the last layers of LeNet5 and AlexNet8 are depicted to achieve lower accuracy even in the bit resolutions that are greater than 3. Similar behavior we also observe to the ResNet9 architecture, where the performance degradation is visible for the last layer even when applying 3 bits resolution. In the same architecture, it is also depicted that the performance collapses when we apply lower bit resolution on the first layers of the models. More precisely, a degradation near to \(8\%\) is occurred when reducing the bit resolution from 8 to 3 bits.

The equivalent behavior followed by all three architectures is clearly depicted in the second row of the figure, in which we report the minimum bit resolution requirement for each layer without significant performance degradation, \(<1\%\). We observe that when we apply bit reduction to the first and last layers of both LeNet5 and AlexNet8 architectures, they are not able to compensate for the equivalent noise and resist performance degradation. Thus, the minimum bit resolution for such layers is the maximum available, which is 7 bits. A similar behavior is also observed in the ResNet9 architecture. Even though the quantization noise arising from bit reduction can be compensated in bit resolution larger than 4 bits from the first and last layers, still they cannot be adjusted to lower bit resolutions such 2 and 3 bits where the intermediate layers are able to. We attribute this behavior to the significance of the contribution of these layers to the final outputs of the network. In one end, the first layers of the network are responsible for building the representation maps used from the following layer to classify samples and, thus, when injecting quantization noise into them, e.g., in the first two layers by taking into account Fig. 5, results in significant information loss that cannot be recovered in subsequent layers, according to Data Processing Inequality [55]. Therefore, the extracted feature maps lack significant details that could probably be used from the following layers of the network to correctly classify the samples. On the other end, the last layers are intrinsically susceptible to noise, since models are traditionally trained with cross-entropy loss that typically results in classification neurons, meaning the neurons of the classification layer, which outputs relatively larger values for the correct class samples in reference to the incorrect ones according to sample similarities [63]. In this way, without regularization or label smoothing techniques, the outputs are not distributed in an equidistant cluster depending on their class, but in a more linear way, making them susceptible to noise phenomena, such as quantization noise [46]. It is worth noting that not taking into account the noise during the training results in significant performance degradation that can be avoided by taking them into account [24, 28], as it is also demonstrated in the following experimental results.

Fig. 6
figure 6

The inference time for mixed and fixed precision quantization methods that applied on different architectures is depicted. The inference time axis (y-axis) is on log scale and seconds unit. On top of the mixed-precision bar, the percentage of the inference time reduction, which is obtained by applying mixed precision quantization, is reported

To demonstrate the effectiveness of the mixed-precision approach in terms of inference times, we plot in Fig. 6 the expected inference time depending on the NEQB of each layer, applying the analysis framework presented in Sect. 2. More specifically, for each architecture, the left bar represents the inference time needed when the mixed-precision quantization technique is applied. The bit resolution of its layer is obtained from the previous experimental evaluation, as presented in the second row of Fig. 5, defining ideal cases of the mixed precision post-training quantization technique. On the right bar, a fixed precision quantization is reported, where every layer has a resolution of 7 bits. The reduction in inference time is impressive even in smaller models. For example, in the LeNet5 architecture, the inference time of the model is 1/3 lower in contrast to the fixed precision time. While the models are increased in size, and as result the MACs per layer are increased, the inference time is further reduced, as is depicted on ResNet9 architecture, where a 90% reduction is achieved. This is an expected behavior since larger models are over-parameterized, with a large number of MACs per layer, and therefore each parameter has a smaller contribution to the final output in contrast to smaller ones. To this end, a smaller bit resolution can be applied to the network without significant performance degradation, as shown in Fig. 5. This combined with the fact that the inference time is reduced exponentially in reference to the bit resolution, as depicted in Fig. 2b at Sect. 2, we observe this huge improvement in terms of time performance.

4.2 Image classification

For image classification benchmarks, we are using two traditional datasets, MNIST and CIFAR10, applying two architectures, a 4-layer MLP and an 8-layer CNN, respectively. The small architectures are used to demonstrate the capability of the method to perform on architectures that can be potentially implemented using electro-optical components, based on current capabilities and limitations. For the MNIST case, the employed architecture is composed of 4 linear layers, with the first and last layers having 10 neurons each, and the intermediate layers having 20 neurons. Photonic activation functions are applied after every hidden layer. In the case of CIFAR10, we use a 8-layer CNN using residual connections. More specifically, the CNN consists of 3 convolutional layers, 2 residual blocks, which are composed of 2 convolutional layers each, and an output classification layer. Both first two convolutional layers use \(3\times 3\) kernels, without bias, applying stride and padding equal to one, with 64 and 128 channels, respectively. Both layers are followed by a batch normalization layer, while the second layer is also followed by a maxpooling layer, with kernel size and stride equal to 2. The output of the maxpooling layers is then passed to the residual block, composed of two convolutional layers, both of which are followed by a batch normalization layer, where the input of the residual block is added to its outputs. A similar residual block with 256 channels is applied after the fifth convolutional layer and before the output linear layer. The CNN architecture is a similar but smaller ResNet architecture, named ResNet8.

For both image classification tasks we use RMSProp optimizer with learning rate equal to 0.0001 and batch size equal to 256. We train the models for 100 epochs, and we repeat the process for 5 times to report the average and variances of evaluation accuracies for every evaluated method in Tables 1 and 2. For the proposed method, we report the average evaluation accuracy of the average bit resolution, where the latter is calculated by floor rounding the bit resolutions of the layers. We use the default hyperparameters of the proposed method setting the initial bit resolution in a range between 8 and 3 bits. As a reference, we also report the average evaluation accuracy over five runs for 32-bit full-precision models.

Table 1 Evaluating the proposed method on the MNIST dataset using a fully connected 4-layer architecture. The table reports the average evaluation accuracy and standard deviation over 5 runs

In Table 1, we report the evaluation accuracy over 5 evaluation runs for MNIST cases. We first observe that the quantization-aware training (columns 3–4) is crucial for the performance of the models. More specifically, applying the post-training quantization method (column 2) in photonic sinusoidal configuration, the performance of models is collapsed. Even in cases of photonic sigmoid where the post-training quantization obtains compatible performance, it still leads to a huge performance degradation, especially for resolutions lower than 4 bits. This highlights the limitations of such post-training quantization methods that uniformly quantize parameters without taking into account the possible outliers, making them especially susceptible and unstable to quantization noise. Such instabilities can be avoided by using quantization-aware training approaches that make models more robust to outliers using EMA to calculate the proxy minimum, \(h_{\rm min}\), and, maximum, \(h_{\rm max}\), values. Quantization-aware trained models (column 3) significantly resist performance degradation even at resolutions down to 3 bits, where performance degradation that occurs against fully precision models is lower than \(9\%\) in photonic sinusoidal and lower than \(3\%\) in sigmoid case. The proposed method (column 4) achieves even better performance in all the evaluated resolutions, significantly increasing the accuracy in lower bit resolutions, such as in the case of 2 bits in the photonic sinusoidal configuration, where the accuracy is improved about \(9\%\) in contrast to the quantization-aware training, highlighting the contribution of gradually reducing bit resolution in a mixed-precision manner. Furthermore, it allows us to decrease the bit resolution to 3 bits in both activation functions, with a minimum performance degradation lower two \(4\%\), against the full-precision models.

Fig. 7
figure 7

Figure depicts the training process using the proposed method. The first two columns depict the model’s distribution of probability density for bit reduction in different epochs. In the first row of the last column, the training accuracy is reported for the proposed method in contrast to the quantization-aware training. The last subfigure of the second row depicts the probability of each layer during the training

In Fig. 7, we plot the probability distribution for each layer over different epochs, as well as the accuracy during training, contrary to quantization-aware training, giving us further insight into the advances of the proposed methods. More specifically, in the first two columns of the figure we schematically represent the probability density of every layer of the network for 4 different epochs (5, 10, 25 and 50). The network is trained for 100 epochs, using the default hyperparameters for the proposed method, with initial and minimum resolutions set to 8 and 2 bits, respectively. As depicted in the first two columns of the figure, the intermediate layers (layers 2 and 3) have a higher probability of reducing their bit resolutions in epoch 5. At epoch 10, the probability of the second layer is reduced, since a bit reduction is performed at an epoch between 5 and 10, while the probability of the third layer is increased.

This is depicted more clearly in the subfigure in the third column of the second row of the figure, where the probabilities of every layer during training are plotted. As shown, for every layer, the probability is increased until the point where the bit reduction occurs, with this point differing for each layer, depending on its position and the epoch when the last bit reduction is performed. For example, for the second layer, two bit reductions have occurred until epoch 25, one at epoch 8 and another at 19, while for the third one, only one bit reduction occurs at epoch 17. The third layer reaches the maximum probability at epoch 70 where the last bit reduction is performed, and after that its probability is reduced to zero since the layer is reached to the lowest available bit resolution (2 bits).

The epochs of the bit reductions are reported in the last subfigure of the first row with a star marker and the reference color of the layer. In this subfigure, we also observe the benefits of the proposed method in contrast to the quantization-aware training. The quantization-aware training reaches a plateau in very early stages of training and after that only minor improvements are observed. On the other hand, the proposed method converges faster to an optimal local minima, with bit reduction impacts the performance later during the training but still leading to better performance than quantization-aware training. At the end of the training, the models trained with the proposed method end with a resolution of 2 bits in the intermediate layers, 6 bits in the first layer and 4 in the last, achieving \(5\%\) better accuracy than the models trained with quantization-aware training.

Table 2 Evaluating the proposed method on the CIFAR10 dataset using ResNet8. The table reports the average evaluation accuracy and standard deviation over 5 runs

Performance improvements obtain also in convolutional architectures when we apply the proposed gradual mixed-precision quantization-aware training, as reported in Table 2, highlighting the generalization ability of the method. More specifically, in convolutional architectures the employed post-training quantization baseline (column 2) is collapsed even from 7 bit resolutions. To this end, quantization-aware training (column 3) essentially contributes to model performance when lower bit resolutions are applied. More specifically, by applying quantization-aware training we are able to reduce the bit resolution down to 4 bits with lower than \(10\%\) performance degradation. On top of that, the proposed method (column 4) enables us to reduce the bit resolution, achieving even better performances. Interestingly, in the cases of photonic sigmoid, the improvements in evaluation accuracy against quantization-aware training are larger than \(>2\%\) in all different bit resolutions.

The proposed method can also improve the inference time required for a PNN. Using the formula presented in Sect. 2, we calculate the inference time for different evaluation runs. More specifically, in Fig. 8, the inference time is reported for both the quantization-aware training and proposed method. As depicted in both figures, applying the proposed quantization method we are able to perform inference faster and more accurately in contrast to quantization-aware training. Faster inference can be very useful in the cases of high-rate applications, where high-frequency response is required. As presented, applying the proposed method leads to better performance for average bit precision in both MLPs and CNNs, irrespective of the applied activation function, and significantly reduces the inference time with the same or better accuracy. This highlights the effectiveness and robustness of the proposed method in different architectures, as well as its generalization ability.

Fig. 8
figure 8

Inference times in reference to the accuracy are presented for quantization-aware training and proposed method applying both photonic activation functions

4.3 Forecasting financial time series

The dataset used to evaluate the photonic recurrent architecture is a high-frequency financial time series limit order book dataset (FI-2020) [64], consisting of more than 4,000,000 limit orders from 5 Finnish companies. The data processing schema and evaluation procedure are extensively described in [58]. For the following experiments, dataset splits 1 to 5 were used. The forecasting task is to predict the movement of the future mid-price after the next 10 time steps, which can go down, up or remain stationary. The DL network used for the experiment consists of a recurrent photonic layer with 32 neurons, as described in Sect. 2. The output of the recurrent layer is fed to two fully connected layers consisting of 512 and 3 neurons, respectively. The length of the time series that is fed to the model is 10 (current and the past 9 timesteps). The models are optimized for 10 epochs using the RMSProp optimizer, while the learning rate is set to \(10^{-4}\).

Table 3 Evaluating the proposed method on the FI2010 dataset using RNN. The table reports the average Cohen kappa metric and standard deviation on five splits in 3 evaluation runs

The average and variance Cohen kappa scores for five splits in 3 evaluation runs are reported in Table 3 for both activation functions. The kappa metric is used since the dataset is highly imbalanced. Similarly to image classification cases, the proposed method outperforms baseline approaches and significantly improves the performance of models when lower bit resolutions are applied. In both activation functions, we even observe performance improvements in contrast to the fully precision models when we employ the proposed method. This is an expected behavior, since quantization regularizes the network and in this way over-fitting is avoided. To this end, higher scores in average cases are observed in contrast to fully precision models, even when applying the proposed method to lead in average bit resolutions near 4 bits. Concluding, using the proposed framework can lead to higher-performance capabilities in the average case with lower bit requirements in a wide range of applications, architectures, and irrespective of the employed photonic configuration, as it is demonstrated from the conducted experiments.

5 Conclusions

We propose a quantization-aware training framework that is oriented to uniform mixed-precision quantization, but can be easily extended also to other quantization schemes such as logarithmic and dynamic approaches. Additionally, evaluating the energy efficiency of such quantization schemes using a theoretical framework is a promising research direction since it unlocks the capabilities of evaluating quantization approaches before been deployed on the actual hardware. This allows one to estimate the energy consumption of an applied PNN beforehand.

In this paper, we focus on lowering the bit resolution within a model to increase the computational rate when novel dynamic PNNs are applied. More specifically, we proposed a quantization-aware training method that enables one to significantly reduce the bit resolution of layers, based on the observation that intermediate layers are able to perform in lower bit resolutions without negatively affecting the performance of the models. The proposed method leverages advantages over traditional quantization-aware training methods, since it takes into account the required bit distribution within the network, and gradually reduces the resolution of each layer, according to their position. Taking into account the conducted experiments, we propose to normally distribute probabilities, defining the possibility of bit reduction on each layer, and incrementally increase them during the training. The effectiveness of the proposed method is demonstrated in various architectures, tasks and photonic configurations, confirming its capabilities to significantly decrease the average bit resolution of models in contrast to evaluated baselines, as well as to reduce the inference time. In this way, it benefits the potential use of analog computing and, more specifically, the use of neuromorphic photonics by further increasing the computational rate of the developed accelerators while keeping the energy consumption low.