Abstract
The energy demanding nature of deep learning (DL) has fueled the immense attention for neuromorphic architectures due to their ability to operate in a very high frequencies in a very low energy consumption. To this end, neuromorphic photonics are among the most promising research directions, since they are able to achieve femtojoule per MAC efficiency. Although electrooptical substances provide a fast and efficient platform for DL, they also introduce various noise sources that impact the effective bit resolution, introducing new challenges to DL quantization. In this work, we propose a quantization-aware training method that gradually performs bit reduction to layers in a mixed-precision manner, enabling us to operate lower-precision networks during deployment and further increase the computational rate of the developed accelerators while keeping the energy consumption low. Exploiting the observation that intermediate layers have lower-precision requirements, we propose to gradually reduce layers’ bit resolutions, by normally distributing the reduction probability of each layer. We experimentally demonstrate the advantages of mixed-precision quantization in both performance and inference time. Furthermore, we experimentally evaluate the proposed method in different tasks, architectures, and photonic configurations, highlighting its immense capabilities to reduce the average bit resolution of DL models while significantly outperforming the evaluated baselines.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Deep learning (DL) is one of the fastest growing domains dominating and significantly accelerating the advancements of different aspects of industrial development and academic research [1]. The generative pre-trained transformer 3 (GPT-3) [2], which has recently gained immense attention, is one of the latest achievements that epitomizes the increasingly successful applications of DL in sophisticated tasks, such as natural language processing. Similar achievements have also been demonstrated in artificial generated images, such as DALL-E [3], computer vision with the transformers-like architectures [4] and in playing games on the superhuman level such as AlphaZero [5].
To this end, neuromorphic photonics are among the most promising approaches, with recent layouts already paving a realistic road map toward femtojoule per MAC efficiencies [6], potentially allowing one to perform MAC operations with a very low-energy consumption (\(10^{-15}\) joules). These architectures are capable of transmitting optical signals near to the speed of light—significantly exceeding their electronic counterparts—which are then manipulated to provide the functionality of neurons [7, 8]. Several approaches have been proposed to this end, employing both purely optical devices [9] and advanced high-speed electro-optical substances [10, 11]. Such architectures provide ultra-high computational rates since they operate in very high frequencies, exploiting their massive parallelism capabilities arising from their enormous bandwidth [12,13,14]. On top of that, they are operating in very low-energy and power consumption envelopes [15], making them appealing in DL applications and especially in applications that require high-frequency and low-energy consumption, such in fiber network communications [16] and network monitoring [17, 18].
Although electrooptical substances provide a fast and efficient platform for DL, they also introduce various noise sources that impact the effective bit resolution. More specifically, photonic computing includes the employment of digital-to-analog (DAC) and analog-to-digital (ADC) conversions along with the parameters encoding, amplification and processing devices, such as modulators, photodiodes (PDs) and amplifiers, that, inevitably, introduce degradation to the analog precision during inference, since each constituent introduces a relevant noise source that impacts the electro-optic link’s bit resolution properties. Thus, the noise introduced increases when higher line rates were applied, translating to lower bit resolution. Several enhancements have been proposed to this end, based on the fact that neural networks can compensate for noise during inference if they are first trained to withstand them [19, 20]. Such approaches range from taking into account the limited bandwidth of optical neurons [21, 22], simulating noise using white Gaussian noise [23] to more advanced schemes, such as initializing the networks taking into account the noise and data distribution [24]. Furthermore, existing approaches deal with the noise impairments originating from limited precision substances and AD/DA conversions, by applying post-quantization [25] or quantization-aware training [26, 27] techniques, significantly improving the performance of models [28].
Typically, the degradation introduced to analog precision can be simulated through a quantization process that converts a continuous signal to a discrete one by mapping its continuous set to a finite set of discrete values [29]. This can be achieved by rounding and truncating the values of the input signal. Despite the fact that quantization techniques are widely studied by the DL community [30,31,32], they generally target large convolutional neural networks (CNNs) containing a large number of surplus parameters with a minor contribution to the overall performance of the model [33, 34]. These large architectures are easily compressed, in contrast to smaller networks, such as those currently developed for neuromorphic photonics, in which every single parameter contributes greatly to the final output of the model [30]. Furthermore, existing works focus mainly on partially quantized models that ignore input and bias [30, 35]. These limitations, which are further exaggerated when high-slope photonic activations are used, dictate employing different training paradigms that take into account the actual physical implementation [36]. Indeed, neuromorphic photonics impose new challenges on the quantization of the DL model, requiring the appropriate adaptation of existing methodologies to the unique limitations of photonic substrates, e.g., using photonic activation functions. Furthermore, the quantization scheme applied in neuromorphic photonics typically follows a very simple uniform quantization, because it depends on the DAC/ADC modules that quantize the signals equally and symmetrically [37, 38]. This differs from the approaches traditionally used in trainable quantization schemes for DL models [39], as well as mixed-precision quantization [40, 41].
Being able to operate in lower-precision networks during deployment can further improve the potential use of analog computing by increasing the computational rate of the developed accelerators, while keeping the energy consumption low [42, 43]. In this work, we focus on recently proposed high-speed analog photonic computing [44] that unlocks dynamic precision capabilities for photonic neural networks (PNNs) [45, 46]. We propose a stochastic mixed precision quantization-aware training scheme that is capable of adjusting the bit resolutions among layers in a mixed-precision manner, based on the observed bit resolution distribution of the applied architectures and configurations. More specifically, it gradually reduces the bit resolution of layers, attaching a higher probability for lower bit resolutions to intermediate layers than to outer ones, following a normal distribution, exploiting and evaluating the observation that intermediate layer bit precisions can be significantly reduced without negatively affecting the performance of the model, while dramatically decreasing the inference time [45]. The proposed quantization-aware training method takes into account the quantization noise that arises from the uniform quantization of the learnable parameters, inputs, and activation values, while gradually reducing the bit resolution based on their position and the training epoch. We claim that the shape of the distribution of bit resolution within a network is an inverted bell and, therefore, we proposed a normal distributed probability to reduce bit resolutions, as conceptually depicted in Fig. 1.
The proposed method enables us to lower the average bit resolution of models in a gradual and mixed-precision manner, without significant performance degradation. As a result, it provides us the capability of exploring and applying lowered precision configurations, increasing the computation rate in contrast to fixed precision models. To demonstrate the effectiveness of the proposed method, we applied it in a wide range of architectures in various applications, ranging from image classification to financial time-series forecasting, employing photonic activation functions. Based on the photonic architecture proposed in [44], we employed a framework, which is proposed in [45], to quantitatively evaluate the inference time of lower bit precision models, highlighting in this way the benefits of applying the proposed mixed-precision quantization-aware training approach. This paper is an extended version of our preliminary work presented in [26], which proposed a quantization-aware training method, oriented to PNNs, that takes into account the quantization noise introduced from uniform quantization. In this work, we further extend our previous work, proposing a mixed-precision approach that gradually reduces the bit resolution of layers during the quantization-aware training, taking into account their relative position in the artificial neural network (ANN) and the training epoch.
The rest of this paper is structured as follows. Section 2 provides the necessary background on photonic DL, while the proposed method is introduced and described in Sect. 3. Finally, the experimental evaluation is provided in Sect. 4, while the conclusion is drawn in Sect. 5
2 Background
Similarly to the software-implemented ANNs, photonic ones are based on perceptrons with the ultimate goal of approximating a function \(f^*\). More precisely, the input signal of the photonic ANN is denoted as \(\varvec{x} \in {\mathbb {R}}^{N_{i-1}}\), where \(N_{i-1}\) represents the number of input features at \(i\)-th layer. Each sample in the train dataset is labeled with a vector \(\varvec{t} = \varvec{1}_c \in {\mathbb {R}}^{C}\), where the \(c\)-th element equals to 1 and the other elements are 0 if it is a classification task (C denotes the number of classes) or a continuous vector \(\varvec{t} \in {\mathbb {R}}^{C}\) if it is a regression task (C denotes the number of regression targets). MLPs approximate \(f^*\) by using more than one layer, i.e., \(f^{(n)}(...(f^{(2)}(f^{(1)}(\varvec{x};\varvec{\theta }^{(1)})\varvec{\theta }^{(2)};)\varvec{\theta }^{(n)})= \varvec{z}^{(n)}\) and learn the parameters \({\varvec{\theta }}^{(i)}\) where \(0 \le i \le n\) with \({\varvec{\theta }}^{(i)}\) consisting of the weights \(\varvec{w}^{(i)} \in {\mathbb {R}}^{N_i \times N_{i-1}}\) and biases \({\textbf {b}}^{(i)} \in {\mathbb {R}}^{N_i}\). Subsequently, each layer’s output is denoted as:
The output of the linear part of a neuron is fed to a nonlinear function \(g(\cdot )\), named activation function, to form the final output of the layer:
The training of an ANN is achieved by updating its parameters, using the backpropagation algorithm, aiming to minimize a loss function \(J(\varvec{y}, \varvec{t})\), where \(\varvec{t}\) represents the training labels and \(\textbf{y}\) the output of the network. Cross-entropy loss is often used in multi-class classification cases:
Except from feedforward ANNs, in this paper, we also employ the quantization methods to a simple-to-apply recurrent neuromorphic photonic architecture [47, 48]. The applied recurrent architecture benefits from the existing photonic feedforward implementations [49] while using a feedback loop. Following the above notation and the fact that the recurrent architectures accept sequential data as input, let \(\textbf{x}\) be a multidimensional time series, while let \(\textbf{x}_{t} \in \mathbb {R}^{N_{in}}\) denote \(N_{in}\) observations fed to the input at the t-th time step. Then, the input signal is weighted by the i-th neuron using the input weights \(\textbf{w}_i^{(in)} \in \mathbb {R}^{N_{in}}\). Furthermore, the recurrent feedback signal, denoted by \(\textbf{y}^{(r)}_{t-1} \in \mathbb {R}^{N_r}\), which corresponds to the output of the \(N_r\) recurrent neurons at the previous time step, is also weighted by the recurrent weights \(\textbf{w}_i^{(r)} \in \mathbb {R}^{N_r}\). The final weighted output of the i-th recurrent neuron is calculated as:
It should be noted that we omitted the bias term to simplify the employed notation. Then, this weighted output is fed to the employed photonic nonlinearity \(f(\cdot )\) to acquire the final activation of the neuron as:
In this case of study two photonic activation functions are used. First, the photonic sigmoid activation function is defined as [50]:
in which the parameters \(A_1 = 0.060, A_2=1.005, z_0=0.154\) and \(d =0.033\) are tuned to fit the experimental observations as implemented on real hardware devices [50].
Also, a photonic sinusoidal activation function is applied on the experimental test evaluations. The photonic layout corresponds to employing a Mach-Zehnder Modulator device (MZM) [51] that converts the data into an optical signal along with a PD [52]. The formula of this photonic activation function is the following:
It is worth noting that, because of the narrow range of the input domain these photonic activations have, training is even more difficult, since the networks tend to be easily saturated, leading to slower convergence or even halting the training.
To evaluate the proposed method, we use a novel analytical framework, proposed in [45], which is capable of correlating the available optoelectrical bandwidth of the underlying photonic components with the corresponding equivalent bit resolution of a PNN. In this way, we are able to identify the major physical mechanisms that define the relationship between the computational rate and the achievable bit resolution of the PNN. In turn, this reveals the latency-precision trade-off of high-speed PNNs, following the paradigms of electronic ANN accelerators [53, 54]. Figure 2a depicts a schematic breakdown of the noise sources that impact the bit resolution performance of the PNN, i.e., *********\(\sigma _{\rm RIN}\) correlated with the noise contribution of the laser source, \(\sigma _{\rm MM}\) corresponding to the noise introduced by the photonic matrix multiplication circuitry, \(\sigma _{\rm shot}\) that corresponds to the random fluctuation of the photodiode’s current, \(\sigma _{\rm dark}\) correlated to the finite dark current of the photodiode, \(\sigma _{\rm ADC}\) that corresponds with the quantization noise imposed by the employed ADC and finally \(\sigma _{T}\) correlated with the thermal noise of the PNN. Utilizing the aforementioned framework in the scope of a typical neuromorphic photonic layout with an insertion loss of 30dB, which can approximate the characteristics of a high-scale PNN deployment when referenced to a laser emitted power of 16dBm, and considering the normalized standard deviation of the noise of the matrix equal to \(\sigma _{\rm MM}=10^{-3}\) and the remaining noise sources equal to the standardized values proposed in [45], we present the relationship between the achievable number of effective bits on a PNN axon versus the bandwidth employed in Fig. 2b.
In this work, we quantitatively evaluate the proposed method on the inference time, by extracting the number of multiply accumulate operations (MACs) needed. The inference time is calculated according to bit resolution and millions of multiply accumulate operations (MMACs) of each layer. To calculate the bandwidth per PNN’s axon according to noise equivalent bits, we fit an exponential function, \(s(\cdot ): \mathbb {R} \rightarrow \mathbb {R}\), to the experimental data obtained in [45] measuring the PNN configuration of [44] with the formula given by:
where \(a=0.82\), \(b=35.07\), \(c=1.68\) and \(d=4.40\) are the coefficients and \(x' = min\{2.4, x\}\). We clipped \(x\) to constrain the maximum available bandwidth according to hardware specifications. The final execution time of the model in seconds is calculated as:
where \(c_{i}\) and \(r_{i}\) are the number of MMACs and equivalent bit resolution of \(i\)-the layer, respectively.
3 Proposed framework
In this work, we propose a framework for quantization-aware training that gradually reduces the bit resolution of layers of the network, taking into account their position and training epoch. The proposed framework is oriented to PNNs and more specifically to the recently proposed dynamic precision architectures [44], but can also be used out of the box for other neuromorphic architectures, without loss of generality.
3.1 Quantization-aware training
The proposed quantization-aware training framework takes into account the quantization error that arises from the limited precision modules. The quantization-aware scheme exploits the intrinsic ability of ANNs to compensate for known noise sources when they are first trained to withstand them [24]. More specifically, the network is trained with quantized parameters by applying uniform quantization to all parameters involved during the forward pass and consequently the quantization error is accumulated and propagated through the network to the output and affects the employed loss function. In this way, the network is adjusted to lower-precision signals, making it more robust to reduced bit resolution during inference, significantly improving model performance. Under the proposed mixed-precision quantization-aware training framework, which is inspired and extends the quantization scheme in [28], every signal that is involved in the response of the \(i\)-th layer is first quantized in a specific floating range \([h^{(i)}_{\rm min}, \dots , h^{(i)}_{\rm max}] \in \mathbb {R}\). Then, during the forward pass the network, quantization error \(\varvec{\epsilon }\) is injected to simulate the effect of rounding during the quantization, while during the backpropagation the rounding is ignored and approximated with an identity function. In this way, the backpropagation process can be performed without any major change to the existing training pipelines, since the input, weights, model parameters and activation values are stored in floating point format during the training. Therefore, our proposed method belongs to the so-called straight thought estimator (STE) quantization family [41].
More specifically, every involved signal is first mapped to the corresponding quantization bin, based on the used bit resolution and assuming that each bin has the same length. Therefore, assuming that \(h^{(i)} \in \mathbb {R}\) represents a value of a signal, e.g., input, weight or output, the quantized version of the signal can be obtained by applying a quantization function \(Q(\cdot ): \mathbb {R} \rightarrow \mathbb {N}\) as:
where \(s_h^{(i)} \in \mathbb {R}^{+}\) is the scale factor for the specific signal, \(\zeta _h^{(i)} \in \mathbb {N}\) is the zero point, \(q_{\rm min}^{(i)} \in \mathbb {R}^+\) and \(q_{\rm max}^{(i)}\in \mathbb {R}^+\) denote the range of a \(r^{(i)}\)-bit positive integer, i.e., \(q_{\rm min}^{(i)}=0\) and \(q_{\rm max}^{(i)}=2^{r^{(i)}}\), while the clip function is defined as:
The scale value is computed as:
while \(h^{(i)}_{\rm max} \in \mathbb {R}\) and \(h^{(i)}_{\rm min} \in \mathbb {R}\) represents the proxy maximum and minimum of \(\varvec{h}^{(i)} \in \mathbb {R}^N\) signal. In turn, the zero point is calculated:
Then, we can convert back the \(h_{q}^{(i)} \in [0, \ldots , 2^{r^{(i)}}-1]\) value to its floating point representation \(h_{f}^{(i)} \in [h^{(i)}_{\rm min}, \ldots , h^{(i)}_{\rm max}]\) using the dequantization function \(D(\cdot ): \mathbb {N} \rightarrow \mathbb {R}\) as:
Following the above notation, the quantized response of the \(i\)-th layer of a network, before applying the activation function, can be calculated as
where \(Quant(\varvec{x}) = D(Q(\varvec{x}, s_x^{(i)}, \zeta _x^{(i)}), s_x^{(i)}, \zeta _x^{(i)})\) denotes the process of quantization followed by the dequantization of a vector \(\varvec{x}\in \mathbb {R}^{N_{i=1}}\) and can be applied element-wise on vectors, while \(\varvec{w}^{(i)}_{f} \in [w^{(i)}_{\rm min}, \ldots , w^{(i)}_{\rm max}]^{N_i \times N_{i-1}}\) and \(\varvec{b}^{(i)}_{f} \in [b^{(i)}_{\rm min}, \ldots , b^{(i)}_{\rm max}]^{N_i}\) denotes the quantized weights and biases of the \(i\)-th layer. Finally, the output \(\textbf{z}_{f}^{(i)}\) passes through the photonic activation function, \(g(\cdot )\) of the neuron:
Therefore, all signals involved in neuron output are distributed in a uniform floating range between \(h_{\rm min}^{(i)}\) and \(h_{\rm max}^{(i)}\) and they can be represented by \(b^{(i)}\) bits. Thus, the quantization error is propagated through the network as a noise signal that is considered during the training process. This can be trivially implemented by injecting quantization errors during the forward pass as follows:
where \(\varvec{\epsilon _w}\), \(\varvec{\epsilon _y}\) and \(\varvec{\epsilon _g}\) are the weight, linear response and activation quantization errors that are calculated as the difference between the original value and the quantized one obtained using the \(Quant(\cdot )\) function. For example, for the weights \(\varvec{w}\), the quantization error is calculated as \(\varvec{\epsilon }_w = Quant(\textbf{w}) - \textbf{w}\). This process is illustrated in Fig. 3 for the feedforward networks. It is worth noting that without loss of generality and the same quantization framework can be applied to other architectures as well, such in recurrent ones. We should note that during the training, the quantization effect is simulated, while backpropagation happens as usual, meaning that the original parameters are updated according to the propagated loss.
The proxy minimum (\(h_{\rm min}\)) and maximum (\(h_{\rm max}\)) values of a signal \(\varvec{h} \in \mathbb {R}^N\), in which the scale of the uniform buckets depends on, can significantly affect quantization noise, both on training and inference. A large value for \(h_{\rm max}\) (or a small value for \(h_{\rm min}\), respectively) originates from an outlier of a signal, causing a significant portion of signal values to fall within the same buckets and in turn inducing information loss and high quantization noise. To this end, we propose a method to adjust the \(h_{\rm min}\) and \(h_{\rm max}\) values during training using exponentiation moving average (EMA). EMA enables elimination of outliers in vectors and matrices, smoothing in this way the quantization-aware training process. Therefore, models become more robust to outliers during the training and, in turn, more robust to quantization noise injected into signals during the inference.
The EMA is applied iteratively in every training iteration since the distribution of every signal is transformed during the training. To this end, the proxy minimum (\(h_{\rm min}^{(i)}\)) and maximum (\(h_{\rm max}^{(i)}\)) values of a signal \(\varvec{p}^{(i)} \in \mathbb {R}^N\) calculated in every training iteration \(t\) as:
where \(\beta\) is the weighting parameter of the EMA and the update is applied for \(t>\lceil \beta \rceil\).
3.2 Mixed-precision quantization-aware training
In addition to quantization-aware training, we proposed a gradual approach to reduce the bit resolution of the layers depending on their positions and the training epoch. The proposed method takes into account the expected or observed distribution of bit requirements of every layer of the network and accordingly reduces their bit resolution in a mixed-precision manner during the training. More precisely, in this work, we claim that intermediate layers require lower bit resolutions, in contrast to the first and last layers, which are susceptible to quantization noise and negatively affect the performance of the model when lower bit resolutions are applied. In the case of the first layers, we claim that according to the data processing inequality, the information lost in one layer cannot be recovered in subsequent ones [55]. On the other end, the last layers of the network are dedicated to project the extracted features to the multidimensional output space that affects the predictions made by the model. During the training, especially when the cross-entropy loss function is used, similar samples (for example, digits of 2 and 5 in the MNIST dataset) are placed near in the hyperspace, in contrast to the significant dissimilar samples that are placed far the one from the other (for example, digits of 1 and 5 in the MNIST dataset). Thus, models are more susceptible to noise when applied to the last layer(s), especially when similar samples are included to the evaluation dataset, reducing in this way their performance.
To this end, we attach to the middle layers a higher probability of bit resolution reduction during the training, while for the outer layers, we attach a smaller probability. The probabilities are distributed to the layers based on a Gaussian distribution, and in every epoch they are increased according to the introduced hyperparameters. Conceptually, the proposed method is motivated by the roulette wheel selection approach that is widely used in genetic algorithms (GAs) [56]. In contrast to GAs that attach static probabilities to the chromosomes based on their fitness values, the proposed method attaches probabilities to layers that are normally distributed, depending on layer’s positions, and gradually increases them during the training.
More precisely, for each layer, we define the maximum probability of reducing its bit resolution. If a layer reaches the highest probability, then the probability remains constant either until the end of training or until a bit reduction occurs. If a bit reduction is performed in the \(t\)-th epoch, in the next epoch it will start from the lowest probability. More specifically, at every epoch \(t\) we draw such values \(\varvec{u} \in \mathbb {R}^{n}\) as the number of layers from respective Gaussian distributions \(\varvec{u} \sim \left[ \mathcal {N}^{(0)}(\mu , \sigma ), \ldots , \mathcal {N}^{(n)}(\mu , \sigma ) \right]\), where \(n\) denotes the number of layers, \(\mu\) and \(\sigma\) defines the mean and standard deviation of the Gaussian distributions, respectively, and they are hyperparameters of the proposed method. Every layer \(i\) is assigned a maximum probability of reducing its bit resolution, which depends on its position, and it can be calculated with the cumulative distribution function (CDF) of the standard normal distribution as:
giving the probability density that lies in the range between \(a^{(i)}\) and \(a^{(i+1)}\), where \(\varvec{a} \in \mathbb {R}^{n+1}\) defined as \(\varvec{a} = [-\tau , \ldots , \tau ]\) expressing the uniform slices of Gaussian distribution in \(z\)-axis, as depicted in the top row of Fig. 4. The \(\tau\) represents the width of the slices, affecting the maximum probability of each layer. For example, by increasing the \(\tau\), the maximum probabilities of the inner layers increased, while decreasing the probabilities of the outer layers. Respectively, when \(\tau\) value decreased, maximum probabilities of the inner layers were reduced and, consequently, those of the outer layers increased.
To this end, if the layer is on its maximum probability, the mixed precision algorithm checks if the point \(u^{(i)}_t \sim \mathcal {N}^{(i)}(\mu , \sigma )\), drawn from the respective Gaussian distribution, is within the range, \(u^{(i)}_t \in [a^{(i)},a^{(i+1)})\), with \(p^{(i)}_{{\rm max}, t}\) probability at \(t\)-th timestep. If \(a^{(i)} \le u^{(i)}_t < a^{(i+1)}\) and \(r^{(i)} > r_{\rm min}\), where \(r_{\rm min} \in \mathbb {N}_{+}\) defines the minimum allowed bit resolution, then the bit resolution of \(i\)-th layer is dropped down to \(r^{(i)}_{t+1} = r^{(i)}_t - r_{\rm step}\), where the \(r_{\rm step} \in \mathbb {N}_{+}\) defines the bit reduction step. This can be expressed as:
After introducing a stochastic approach that gives in a systematical way more possibility for bit reduction in layers that not crucially affect the performance of the network, based on both the theoretical and experimental observations [41], we also introduce a gradual way of doing it, during the quantization-aware training. Giving the full range of \([a^{(i)},a^{(i+1)})\) in \(i\)-th layer to reduce their bit resolution and, thus, the maximum possibility of \(p^{(i)}_{\rm max}\) in every epoch, it will result in very low precision in the very first few epochs of the training. As a result, quantization noise will be significantly high in the first epochs of training where the network is unstable and not close to their convergent stage, causing difficulties in training, such as vanishing gradient phenomena [57]. This can lead to a bad local minimum and significant performance degradation. To this end, we propose a gradual bit reduction method that in every epoch increases the probability of layers to reduce their resolution. This is achieved by uniformly slicing the maximum possible range, \([a^{(i)},a^{(i+1)})\), of \(i\)-th layer anew, as depicted in the second row of Fig. 4.
More specifically, we introduce a divisor value \(\delta \in \mathbb {N}_{+}\) that slices the maximum available range of \(\varvec{A}^{(i)} = [a^{(i)},a^{(i+1)})\) in \(\delta\) uniform slices as:
where the \(\varvec{\Delta }^{(i)}\) is called active range of \(i\)-th layer.
At every epoch \(t\), the active range is increased either until its maximum available range, given by \(\varvec{\Delta }^{(i)}_{\delta } = [a^{(i)},a^{(i+1)})\), which is equivalent to the maximum probability density, \(p^{(i)}_{\rm max}\), of \(i\)-th layer, or until the point drawn from the respective Gaussian distribution \(u^{(i)}_t \sim \mathcal {N}^{(i)}(\mu , \sigma )\) drops within the active range \(u^{(i)}_t \in \varvec{\Delta }^{(i)}_{j_t^{(i)} }\). The \(j_t^{(i)} \in [1, \ldots , \delta ]\) defines the epoch passed either from the start of training or from the last bit reduction. Therefore, at every training epoch \(t\) that the drawn value is not within the active range, \(u^{(i)}_{t} \notin \Delta ^{(i)}_{j_t^{(i)} }\), the \(j_t^{(i)}\) increased by a unit \(j_{t+1}^{(i)} =min\{j_{t}^{(i)}, \delta \}\) if it is not equal to \(\delta\), increasing in this way the active range \(\varvec{\Delta }^{(i)}_{j+1}\) by another a slice, expressed by
The two cases define the probability increase for both sides of the normal distributions, as shown in the second row of Fig. 4. If the range of the layer is on the negative side of the distribution, then the probability increases by unlocking the next slice on its right. Otherwise, if the layer is on the positive side of the distribution, the probability is increased by decreasing the lower bound to unlock another slice. In both cases, the probability of the \(i\)-th layer performing a bit reduction is increased, \(p^{(i)}_{j+1} \ge p^{(i)}_{j}\). In the case where \(u^{(i)}_t \in \Delta _{j}^{(i)}\), the active range is reset to its minimum, \(\varvec{\Delta }^{(i)}_{j=1}\), and the method performs a bit reduction according to \(r^{(i)}_{t} = {\rm max}\{r^{(i)}_{t-1} - r_{\rm step}, r_{\rm min}\}\). This procedure enables us to smoothly increase the injected quantization noise, while ensuring that the noise will be first introduced in layers that are more robust to its effects. The probability of the \(i\)-th layer to reduce its precision by \(r_{\rm step}\) bits at \(t\)-th epochs is given by:
where \(j\) is calculated at every epoch \(t\) as:
with \(\{j_1^{(i)}=1 \mid 1< i < n \}\).
In this work, we used the default hyperparameter values for all experimental evaluation cases. More specifically, we empirically propose using a normal distribution with 0 mean and variance equal to 1, \(\mathcal {N}(\mu =0,\sigma =1)\), while setting \(\tau =3\). The gradual increment of probability for bit reduction of each layer depends on \(\delta\), and in turn, we set equal to \(\delta =M/4\), where \(M\) is the number of epochs. Finally, the bit reduction step is set to \(r_{\rm step}=2\). It should be mentioned that the proposed method is presented for an even number of layers for simplicity, but it can be used straightforwardly for an odd number of layers simply by shifting the normal distribution right or left.
The proposed framework is presented algorithmically in Algorithms 1 and 2. Algorithm 1 presents the mixed-precision quantization-aware training, while Algorithm 2 presents the quantization of the parameter vector or matrix. More specifically, Algorithm 1 takes as input the number of epochs \(M\), trainable parameters for each layer (meaning weights and biases), initial and minimum bit resolution, as well as the hyperparameters of mixed-precision quantization-aware training, such as \(r_{\rm step}\), \(\tau\) and \(\delta\). First, the algorithm properly constructs the slices that are used for the maximum probability of each layer (line 2) and attaches to them the minimum active probability (lines 4–5) and the initial bit resolution (line 6). The network is then trained in a quantization-aware manner (lines 7–21), by quantizing every parameter involved during the forward pass (lines 8–19) applying the proposed gradual bit reduction approach, calculating for the \(t\)-th epoch the bit resolution, \(r_{t}^{(i)}\), of each layer (lines 8–15). First, the proposed method draws a value, \(u_{t}^{(i)}\), for each layer, from the respective Gaussian distribution (line 9) and checks whether it is within the active range of the layer (lines 10–15). In cases where the \(u_{t}^{(i)}\) is within the active range (lines 10–12), the bit resolution is reduced if necessary (line 11) and, in turn, resets the active probability to its minimum (line 12). In case where the random value is not in the active range (lines 13–15), the active probability is increased, depending on whether there is on their maximum probability or not.
After that, quantization-aware training can proceed as usual. The quantization of parameters, which is explained in Algorithm 2, is applied to: \(\varvec{w}_{f}^{(i)}\) (weights of the \(i\)-th layer), \(\varvec{b}^{(i)}_{f}\) (biases of the layer) and \(\varvec{y}^{(i-1)}_{f}\) (inputs of the layer) and then to the outputs \(\varvec{z}_{f}^{(i)}\) for each layer \(i\) during the forward pass. Then, backpropagation updates the original weights and biases by applying the loss function \(J(\varvec{y}_f, \varvec{t})\). More specifically, in Algorithm 2 the quantization of a matrix or vector parameter, \(\varvec{p}\), is presented. Initially, we compute \(p^{\rm min}\) and \(p^{\rm max}\) use EMA (lines 2–3) which is used to compute scale and zero point (lines 4–5). In turn, \(\varvec{p}\) is quantized in a set of finite integers between \([0, \ldots , 2^{r^{(i)}_{t}}]\) (line 6) and the back to the floating range \([\tilde{p}^{\rm min}, \ldots , \tilde{p}^{\rm max}]\) (line 7) which is the final output of the algorithm.
4 Experimental evaluation
To demonstrate the effectiveness of the proposed method, we first conduct experiments to investigate the bit requirements for each layer based on its position, showing that intermediate layers require lower bit resolutions, in contrast to outer ones. In turn, we present experimental evaluation results showing the huge inference time benefits of applying mixed-precision quantization techniques. After that, we evaluate the proposed method by applying it to different neural network architectures and configurations, including Multi-Layer Perceptrons (MLPs), CNNs and Recurrent Neural Networks (RNNs). For all applied architectures, we evaluate the performance of the proposed method using two different photonic configurations, based on photonic sigmoid and sinusoidal activation functions, presented in Sect. 2. More specifically, we demonstrate its capabilities in two image classification tasks, using the MNIST and CIFAR10 dataset, and in a high-frequent financial time series forecast, using the FI2010 dataset [58]. We report experimental results for average bit resolutions ranging from 2 to 7 bits, acquired for multiple evaluation runs. Finally, in the case of the CIFAR10 task, we demonstrate the efficiency of the proposed method in terms of inference time, applying the evaluation framework introduced in Sect. 2. For benchmarking, we are using two uniform quantization baselines, (a) a post-training quantization method that using the minimum and maximum values of a matrix to uniformly distribute the distinct values and (b) the quantization-aware training method, presented in Sect. 3, using EMA during the training to calculate the proxy minimum, \(h_{\rm min}\), and maximum, \(h_{\rm max}\), values.
4.1 Mixed-precision evaluation
We experimentally demonstrate the bit requirements of each layer, by investigating different bit resolution configurations in 3 well-known convolutional architectures. More specifically, we applied LeNet5 [59], AlexNet8 [60] and ResNet9 [61] on CIFAR10 traditionally used image classification task, and we trained them employing AdamW [62] optimizer for 100 epochs with learning rate equal to 0.001 and weight decay value equal to \(10^{-5}\). After compiling 5 training runs for each different architecture, we evaluate them on different bit resolutions, starting to employing them from first to last layer, applying the simplest post-training quantization method. The quantization method uniformly quantizes the parameters using for \(h_{\rm min}\) and \(h_{\rm max}\) the minimum and maximum values, respectively. We present the average accuracy over 5 evaluation runs for each bit resolution between 2 and 7 bits, applying on each layer of the network. This allows us to investigate the relation between each layer’s bit resolution and the overall performance of the model.
The experimental results are presented in Fig. 5. More specifically, in the upper row, we plot the average evaluation accuracy for different bit resolutions when applied on different layers. The dotted red line presents the average accuracy, over 5 evaluation runs, when we apply standard 32-bit floating arithmetics. The different lines represent the different layers of the models. In the bottom row, we report the minimum bit resolution requirements for each layer, without a significant accuracy drop, \(<1\%\). In cases where it is not possible to catch the acceptable accuracy, we report the maximum available bit resolution, which is 7 bits. We should note that our analysis does not take into account the inner dependencies of bit requirements, meaning the effects on bit resolution on one layer when the previous one also has reduced bit resolution, since we apply bit reduction only to one layer at a time.
In the first row of the figure, we can distinct that the first and last layers of the network either start from a lower performance, than the intermediate layer, or their performance drops significantly even in lower bit reductions. For example, the last layers of LeNet5 and AlexNet8 are depicted to achieve lower accuracy even in the bit resolutions that are greater than 3. Similar behavior we also observe to the ResNet9 architecture, where the performance degradation is visible for the last layer even when applying 3 bits resolution. In the same architecture, it is also depicted that the performance collapses when we apply lower bit resolution on the first layers of the models. More precisely, a degradation near to \(8\%\) is occurred when reducing the bit resolution from 8 to 3 bits.
The equivalent behavior followed by all three architectures is clearly depicted in the second row of the figure, in which we report the minimum bit resolution requirement for each layer without significant performance degradation, \(<1\%\). We observe that when we apply bit reduction to the first and last layers of both LeNet5 and AlexNet8 architectures, they are not able to compensate for the equivalent noise and resist performance degradation. Thus, the minimum bit resolution for such layers is the maximum available, which is 7 bits. A similar behavior is also observed in the ResNet9 architecture. Even though the quantization noise arising from bit reduction can be compensated in bit resolution larger than 4 bits from the first and last layers, still they cannot be adjusted to lower bit resolutions such 2 and 3 bits where the intermediate layers are able to. We attribute this behavior to the significance of the contribution of these layers to the final outputs of the network. In one end, the first layers of the network are responsible for building the representation maps used from the following layer to classify samples and, thus, when injecting quantization noise into them, e.g., in the first two layers by taking into account Fig. 5, results in significant information loss that cannot be recovered in subsequent layers, according to Data Processing Inequality [55]. Therefore, the extracted feature maps lack significant details that could probably be used from the following layers of the network to correctly classify the samples. On the other end, the last layers are intrinsically susceptible to noise, since models are traditionally trained with cross-entropy loss that typically results in classification neurons, meaning the neurons of the classification layer, which outputs relatively larger values for the correct class samples in reference to the incorrect ones according to sample similarities [63]. In this way, without regularization or label smoothing techniques, the outputs are not distributed in an equidistant cluster depending on their class, but in a more linear way, making them susceptible to noise phenomena, such as quantization noise [46]. It is worth noting that not taking into account the noise during the training results in significant performance degradation that can be avoided by taking them into account [24, 28], as it is also demonstrated in the following experimental results.
To demonstrate the effectiveness of the mixed-precision approach in terms of inference times, we plot in Fig. 6 the expected inference time depending on the NEQB of each layer, applying the analysis framework presented in Sect. 2. More specifically, for each architecture, the left bar represents the inference time needed when the mixed-precision quantization technique is applied. The bit resolution of its layer is obtained from the previous experimental evaluation, as presented in the second row of Fig. 5, defining ideal cases of the mixed precision post-training quantization technique. On the right bar, a fixed precision quantization is reported, where every layer has a resolution of 7 bits. The reduction in inference time is impressive even in smaller models. For example, in the LeNet5 architecture, the inference time of the model is 1/3 lower in contrast to the fixed precision time. While the models are increased in size, and as result the MACs per layer are increased, the inference time is further reduced, as is depicted on ResNet9 architecture, where a 90% reduction is achieved. This is an expected behavior since larger models are over-parameterized, with a large number of MACs per layer, and therefore each parameter has a smaller contribution to the final output in contrast to smaller ones. To this end, a smaller bit resolution can be applied to the network without significant performance degradation, as shown in Fig. 5. This combined with the fact that the inference time is reduced exponentially in reference to the bit resolution, as depicted in Fig. 2b at Sect. 2, we observe this huge improvement in terms of time performance.
4.2 Image classification
For image classification benchmarks, we are using two traditional datasets, MNIST and CIFAR10, applying two architectures, a 4-layer MLP and an 8-layer CNN, respectively. The small architectures are used to demonstrate the capability of the method to perform on architectures that can be potentially implemented using electro-optical components, based on current capabilities and limitations. For the MNIST case, the employed architecture is composed of 4 linear layers, with the first and last layers having 10 neurons each, and the intermediate layers having 20 neurons. Photonic activation functions are applied after every hidden layer. In the case of CIFAR10, we use a 8-layer CNN using residual connections. More specifically, the CNN consists of 3 convolutional layers, 2 residual blocks, which are composed of 2 convolutional layers each, and an output classification layer. Both first two convolutional layers use \(3\times 3\) kernels, without bias, applying stride and padding equal to one, with 64 and 128 channels, respectively. Both layers are followed by a batch normalization layer, while the second layer is also followed by a maxpooling layer, with kernel size and stride equal to 2. The output of the maxpooling layers is then passed to the residual block, composed of two convolutional layers, both of which are followed by a batch normalization layer, where the input of the residual block is added to its outputs. A similar residual block with 256 channels is applied after the fifth convolutional layer and before the output linear layer. The CNN architecture is a similar but smaller ResNet architecture, named ResNet8.
For both image classification tasks we use RMSProp optimizer with learning rate equal to 0.0001 and batch size equal to 256. We train the models for 100 epochs, and we repeat the process for 5 times to report the average and variances of evaluation accuracies for every evaluated method in Tables 1 and 2. For the proposed method, we report the average evaluation accuracy of the average bit resolution, where the latter is calculated by floor rounding the bit resolutions of the layers. We use the default hyperparameters of the proposed method setting the initial bit resolution in a range between 8 and 3 bits. As a reference, we also report the average evaluation accuracy over five runs for 32-bit full-precision models.
In Table 1, we report the evaluation accuracy over 5 evaluation runs for MNIST cases. We first observe that the quantization-aware training (columns 3–4) is crucial for the performance of the models. More specifically, applying the post-training quantization method (column 2) in photonic sinusoidal configuration, the performance of models is collapsed. Even in cases of photonic sigmoid where the post-training quantization obtains compatible performance, it still leads to a huge performance degradation, especially for resolutions lower than 4 bits. This highlights the limitations of such post-training quantization methods that uniformly quantize parameters without taking into account the possible outliers, making them especially susceptible and unstable to quantization noise. Such instabilities can be avoided by using quantization-aware training approaches that make models more robust to outliers using EMA to calculate the proxy minimum, \(h_{\rm min}\), and, maximum, \(h_{\rm max}\), values. Quantization-aware trained models (column 3) significantly resist performance degradation even at resolutions down to 3 bits, where performance degradation that occurs against fully precision models is lower than \(9\%\) in photonic sinusoidal and lower than \(3\%\) in sigmoid case. The proposed method (column 4) achieves even better performance in all the evaluated resolutions, significantly increasing the accuracy in lower bit resolutions, such as in the case of 2 bits in the photonic sinusoidal configuration, where the accuracy is improved about \(9\%\) in contrast to the quantization-aware training, highlighting the contribution of gradually reducing bit resolution in a mixed-precision manner. Furthermore, it allows us to decrease the bit resolution to 3 bits in both activation functions, with a minimum performance degradation lower two \(4\%\), against the full-precision models.
In Fig. 7, we plot the probability distribution for each layer over different epochs, as well as the accuracy during training, contrary to quantization-aware training, giving us further insight into the advances of the proposed methods. More specifically, in the first two columns of the figure we schematically represent the probability density of every layer of the network for 4 different epochs (5, 10, 25 and 50). The network is trained for 100 epochs, using the default hyperparameters for the proposed method, with initial and minimum resolutions set to 8 and 2 bits, respectively. As depicted in the first two columns of the figure, the intermediate layers (layers 2 and 3) have a higher probability of reducing their bit resolutions in epoch 5. At epoch 10, the probability of the second layer is reduced, since a bit reduction is performed at an epoch between 5 and 10, while the probability of the third layer is increased.
This is depicted more clearly in the subfigure in the third column of the second row of the figure, where the probabilities of every layer during training are plotted. As shown, for every layer, the probability is increased until the point where the bit reduction occurs, with this point differing for each layer, depending on its position and the epoch when the last bit reduction is performed. For example, for the second layer, two bit reductions have occurred until epoch 25, one at epoch 8 and another at 19, while for the third one, only one bit reduction occurs at epoch 17. The third layer reaches the maximum probability at epoch 70 where the last bit reduction is performed, and after that its probability is reduced to zero since the layer is reached to the lowest available bit resolution (2 bits).
The epochs of the bit reductions are reported in the last subfigure of the first row with a star marker and the reference color of the layer. In this subfigure, we also observe the benefits of the proposed method in contrast to the quantization-aware training. The quantization-aware training reaches a plateau in very early stages of training and after that only minor improvements are observed. On the other hand, the proposed method converges faster to an optimal local minima, with bit reduction impacts the performance later during the training but still leading to better performance than quantization-aware training. At the end of the training, the models trained with the proposed method end with a resolution of 2 bits in the intermediate layers, 6 bits in the first layer and 4 in the last, achieving \(5\%\) better accuracy than the models trained with quantization-aware training.
Performance improvements obtain also in convolutional architectures when we apply the proposed gradual mixed-precision quantization-aware training, as reported in Table 2, highlighting the generalization ability of the method. More specifically, in convolutional architectures the employed post-training quantization baseline (column 2) is collapsed even from 7 bit resolutions. To this end, quantization-aware training (column 3) essentially contributes to model performance when lower bit resolutions are applied. More specifically, by applying quantization-aware training we are able to reduce the bit resolution down to 4 bits with lower than \(10\%\) performance degradation. On top of that, the proposed method (column 4) enables us to reduce the bit resolution, achieving even better performances. Interestingly, in the cases of photonic sigmoid, the improvements in evaluation accuracy against quantization-aware training are larger than \(>2\%\) in all different bit resolutions.
The proposed method can also improve the inference time required for a PNN. Using the formula presented in Sect. 2, we calculate the inference time for different evaluation runs. More specifically, in Fig. 8, the inference time is reported for both the quantization-aware training and proposed method. As depicted in both figures, applying the proposed quantization method we are able to perform inference faster and more accurately in contrast to quantization-aware training. Faster inference can be very useful in the cases of high-rate applications, where high-frequency response is required. As presented, applying the proposed method leads to better performance for average bit precision in both MLPs and CNNs, irrespective of the applied activation function, and significantly reduces the inference time with the same or better accuracy. This highlights the effectiveness and robustness of the proposed method in different architectures, as well as its generalization ability.
4.3 Forecasting financial time series
The dataset used to evaluate the photonic recurrent architecture is a high-frequency financial time series limit order book dataset (FI-2020) [64], consisting of more than 4,000,000 limit orders from 5 Finnish companies. The data processing schema and evaluation procedure are extensively described in [58]. For the following experiments, dataset splits 1 to 5 were used. The forecasting task is to predict the movement of the future mid-price after the next 10 time steps, which can go down, up or remain stationary. The DL network used for the experiment consists of a recurrent photonic layer with 32 neurons, as described in Sect. 2. The output of the recurrent layer is fed to two fully connected layers consisting of 512 and 3 neurons, respectively. The length of the time series that is fed to the model is 10 (current and the past 9 timesteps). The models are optimized for 10 epochs using the RMSProp optimizer, while the learning rate is set to \(10^{-4}\).
The average and variance Cohen kappa scores for five splits in 3 evaluation runs are reported in Table 3 for both activation functions. The kappa metric is used since the dataset is highly imbalanced. Similarly to image classification cases, the proposed method outperforms baseline approaches and significantly improves the performance of models when lower bit resolutions are applied. In both activation functions, we even observe performance improvements in contrast to the fully precision models when we employ the proposed method. This is an expected behavior, since quantization regularizes the network and in this way over-fitting is avoided. To this end, higher scores in average cases are observed in contrast to fully precision models, even when applying the proposed method to lead in average bit resolutions near 4 bits. Concluding, using the proposed framework can lead to higher-performance capabilities in the average case with lower bit requirements in a wide range of applications, architectures, and irrespective of the employed photonic configuration, as it is demonstrated from the conducted experiments.
5 Conclusions
We propose a quantization-aware training framework that is oriented to uniform mixed-precision quantization, but can be easily extended also to other quantization schemes such as logarithmic and dynamic approaches. Additionally, evaluating the energy efficiency of such quantization schemes using a theoretical framework is a promising research direction since it unlocks the capabilities of evaluating quantization approaches before been deployed on the actual hardware. This allows one to estimate the energy consumption of an applied PNN beforehand.
In this paper, we focus on lowering the bit resolution within a model to increase the computational rate when novel dynamic PNNs are applied. More specifically, we proposed a quantization-aware training method that enables one to significantly reduce the bit resolution of layers, based on the observation that intermediate layers are able to perform in lower bit resolutions without negatively affecting the performance of the models. The proposed method leverages advantages over traditional quantization-aware training methods, since it takes into account the required bit distribution within the network, and gradually reduces the resolution of each layer, according to their position. Taking into account the conducted experiments, we propose to normally distribute probabilities, defining the possibility of bit reduction on each layer, and incrementally increase them during the training. The effectiveness of the proposed method is demonstrated in various architectures, tasks and photonic configurations, confirming its capabilities to significantly decrease the average bit resolution of models in contrast to evaluated baselines, as well as to reduce the inference time. In this way, it benefits the potential use of analog computing and, more specifically, the use of neuromorphic photonics by further increasing the computational rate of the developed accelerators while keeping the energy consumption low.
Data availability
The data that support the findings of this study are available from the corresponding authors on reasonable request.
References
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. https://doi.org/10.48550/ARXIV.2005.14165. arXiv:org/abs/2005.14165
Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, Chen M, Sutskever I (2021) Zero-shot text-to-image generation. https://doi.org/10.48550/ARXIV.2102.12092. arXiv:org/abs/2102.12092
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. https://doi.org/10.48550/ARXIV.1706.03762. arXiv:org/abs/1706.03762
Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T, Lillicrap T, Simonyan K, Hassabis D (2017) Mastering chess and shogi by self-play with a general reinforcement learning algorithm. https://doi.org/10.48550/ARXIV.1712.01815. arXiv:org/abs/1712.01815
Totović AR, Dabos G, Passalis N, Tefas A, Pleros N (2020) Femtojoule per mac neuromorphic photonics: an energy and technology roadmap. IEEE J Sel Top Quantum Electron 26(5):1–15. https://doi.org/10.1109/JSTQE.2020.2975579
Pleros N, Moralis-Pegios M, Totovic A, Dabos G, Tsakyridis A, Giamougiannis G, Mourgias-Alexandris G, Passalis N, Kirtas M, Tefas A (2021) Compute with light: architectures, technologies and training models for neuromorphic photonic circuits. In: 2021 european conference on optical communication (ECOC), pp 1–4. https://doi.org/10.1109/ECOC52684.2021.9606046
Moralis-Pegios M, Totovic A, Tsakyridis A, Giamougiannis G, Mourgias-Alexandris G, Dabos G, Passalis N, Kirtas M, Tefas A, Pleros N (2022) Photonic neuromorphic computing: architectures, technologies, and training models. In: 2022 optical fiber communications conference and exhibition (OFC), pp 01–03
Lin X, Rivenson Y, Yardimci NT, Veli M, Luo Y, Jarrahi M, Ozcan A (2018) All-optical machine learning using diffractive deep neural networks. Science 361(6406):1004–1008
Shen Y, Harris NC, Skirlo S, Prabhu M, Baehr-Jones T, Hochberg M, Sun X, Zhao S, Larochelle H, Englund D et al (2017) Deep learning with coherent nanophotonic circuits. Nat Photonics 11(7):441
Totovic A, Pappas C, Kirtas M, Tsakyridis A, Giamougiannis G, Passalis N, Moralis-Pegios M, Tefas A, Pleros N (2022) Wdm equipped universal linear optics for programmable neuromorphic photonic processors. Neuromorphic Comput Eng 2(2):024010
Giamougiannis G, Tsakyridis A, Mourgias-Alexandris G, Moralis-Pegios M, Totovic A, Dabos G, Passalis N, Kirtas M, Bamiedakis N, Tefas A, Lazovsky D, Pleros N (2021) Silicon-integrated coherent neurons with 32gmac/sec/axon compute line-rates using eam-based input and weighting cells. In: 2021 European conference on optical communication (ECOC), pp 1–4 https://doi.org/10.1109/ECOC52684.2021.9605987
Mourgias-Alexandris G, Moralis-Pegios M, Simos S, Dabos G, Passalis N, Kirtas M, Rutirawut T, Gardes FY, Tefas A, Pleros N (2021) A silicon photonic coherent neuron with 10gmac/sec processing line-rate. In: Optical fiber communication conference (OFC) 2021, pp 5–1. Optica Publishing Group, https://doi.org/10.1364/OFC.2021.Tu5H.1. https://opg.optica.org/abstract.cfm?URI=OFC-2021-Tu5H.1
Tsakyridis A, Giamougiannis G, Mourgias-Alexandris G, Totovic A, Dabos G, Passalis N, Kirtas M, Tefas A, Moralis-Pegios M, Pleros N (2022) Silicon photonic neuromorphic computing with 16 ghz input data and weight update line rates. In: 2022 conference on lasers and electro-optics (CLEO), pp 1–2
Tsakyridis A, Giamougiannis G, Moralis-Pegios M, Mourgias-Alexandris G, Totovic AR, Kirtas M, Passalis N, Lazovsky D, Tefas A, Pleros N (2022) Universal linear optics for ultra-fast neuromorphic silicon photonics towards fj/mac and tmac/sec/mm2 engines. IEEE J Sel Top Quantum Electron 28(6: High Density Integr. Multipurpose Photon. Circ.,pp 1–15 https://doi.org/10.1109/JSTQE.2022.3219288
Kirtas M, Passalis N, Mourgias-Alexandris G, Dabos G, Pleros N, Tefas A (2022) Learning photonic neural network initialization for noise-aware end-to-end fiber transmission. In: 2022 30th European signal processing conference (EUSIPCO), pp 1731–1735. https://doi.org/10.23919/EUSIPCO55093.2022.9909781
Kirtas M, Passalis N, Kalavrouziotis D, Syrivelis D, Bakopoulos P, Pleros N, Tefas A (2022) Early detection of ddos attacks using photonic neural networks. In: 2022 IEEE 14th image, video, and multidimensional signal processing workshop (IVMSP), pp 1–5 https://doi.org/10.1109/IVMSP54334.2022.9816178
Giamougiannis G, Tsakyridis A, Moralis-Pegios M, Mourgias-Alexandris G, Totovic AR, Dabos G, Kirtas M, Passalis N, Tefas A, Kalavrouziotis D, Syrivelis D, Bakopoulos P, Mentovich E, Lazovsky D, Pleros N (2023) Neuromorphic silicon photonics with 50 GHz tiled matrix multiplication for deep-learning applications. Adv Photonics 5(1):016004. https://doi.org/10.1117/1.AP.5.1.016004
Passalis N, Kirtas M, Mourgias-Alexandris G, Dabos G, Pleros N, Tefas A (2021) Training noise-resilient recurrent photonic networks for financial time series analysis. In: 2020 28th european signal processing conference (EUSIPCO), pp 1556–1560 https://doi.org/10.23919/Eusipco47968.2020.9287649
Moralis-Pegios M, Mourgias-Alexandris G, Tsakyridis A, Giamougiannis G, Totovic A, Dabos G, Passalis N, Kirtas M, Rutirawut T, Gardes FY, Tefas A, Pleros N (2022) Neuromorphic silicon photonics and hardware-aware deep learning for high-speed inference. J Lightwave Technol 40(10):3243–3254. https://doi.org/10.1109/JLT.2022.3171831
Mourgias-Alexandris G, Tsakyridis A, Passalis N, Kirtas M, Tefas A, Rutirawut T, Gardes FY, Pleros N, Moralis-Pegios M (2021) 25gmac/sec/axon photonic neural networks with 7ghz bandwidth optics through channel response-aware training. In: 2021 European conference on optical communication (ECOC), pp. 1–4. https://doi.org/10.1109/ECOC52684.2021.9606097
Mourgias-Alexandris G, Moralis-Pegios M, Tsakyridis A, Passalis N, Kirtas M, Tefas A, Rutirawut T, Gardes FY, Pleros N (2022) Channel response-aware photonic neural network accelerators for high-speed inference through bandwidth-limited optics. Opt Express 30(7):10664–10671. https://doi.org/10.1364/OE.452803
Mourgias-Alexandris G, Moralis-Pegios M, Tsakyridis A, Simos S, Dabos G, Totovic A, Passalis N, Kirtas M, Rutirawut T, Gardes F et al (2022) Noise-resilient and high-speed deep learning with coherent silicon photonics. Nat Commun 13(1):5572
Kirtas M, Passalis N, Mourgias-Alexandris G, Dabos G, Pleros N, Tefas A (2023) Robust architecture-agnostic and noise resilient training of photonic deep learning models. IEEE Trans Emerging Top Comput Intell 7(1):140–149. https://doi.org/10.1109/TETCI.2022.3182765
Kirtas M, Passalis N, Oikonomou A, Mourgias-Alexandris G, Moralis-Pegios M, Pleros N, Tefas A (2022) Normalized post-training quantization for photonic neural networks. In: 2022 IEEE symposium series on computational intelligence (SSCI), pp. 657–663. https://doi.org/10.1109/SSCI51031.2022.10022168
Oikonomou A, Kirtas M, Passalis N, Mourgias-Alexandris G, Moralis-Pegios M, Pleros N, Tefas A (2022) A robust, quantization-aware training method for photonic neural networks. In: Iliadis L, Jayne C, Tefas A, Pimenidis E (eds) Engineering applications of neural networks. Springer, Cham, pp 427–438
Paolini E, De Marinis L, Cococcioni M, Valcarenghi L, Maggiani L, Andriolli N (2022) Photonic-aware neural networks. Neural Comput Appl 34(18):15589–15601
Kirtas M, Oikonomou A, Passalis N, Mourgias-Alexandris G, Moralis-Pegios M, Pleros N, Tefas A (2022) Quantization-aware training for low precision photonic neural networks. Neural Netw 155:561–573. https://doi.org/10.1016/j.neunet.2022.09.015
Pearson C (2011) High-speed, analog-to-digital converter basics. Texas instruments application report, SLAA510
Jacob B, Kligys S, Chen B, Zhu M, Tang M, Howard A, Adam H, Kalenichenko D (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. Proc. IEEE computer society conf. on computer vision and pattern recognition, pp 2704–2713 https://doi.org/10.1109/CVPR.2018.00286arXiv:1712.05877
Kulkarni U, Meena S, Gurlahosur SV, Bhogar G (2021) Quantization friendly mobilenet (qf-mobilenet) architecture for vision based applications on embedded platforms. Neural Netw 136:28–39
Lee D, Wang D, Yang Y, Deng L, Zhao G, Li G (2021) Qttnet: quantized tensor train neural networks for 3d object and video recognition. Neural Netw 141:420–432. https://doi.org/10.1016/j.neunet.2021.05.034
Wu J, Leng C, Wang Y, Hu Q, Cheng J (2016) Quantized convolutional neural networks for mobile devices. In: Proc. of the IEEE conf. on computer vision and pattern recognition, pp 4820–4828
Esser SK, McKinstry JL, Bablani D, Appuswamy R, Modha DS (2020) Learned step size quantization
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2017) Quantized neural networks: training neural networks with low precision weights and activations. J Mac Learn Res 18(1):6869–6898
Mourgias-Alexandris G, Moralis-Pegios M, Tsakyridis A, Passalis N, Kirtas M, Tefas A, Rutirawut T, Gardes F, Pleros N (2022) Channel response-aware photonic neural network accelerators for high-speed inference through bandwidth-limited optics. Opt Express 30(7):10664–10671
Shastri BJ, Tait AN, de Lima TF, Pernice WH, Bhaskaran H, Wright CD, Prucnal PR (2021) Photonics for artificial intelligence and neuromorphic computing. Nat Photonics 15(2):102–114
Nahmias MA, de Lima TF, Tait AN, Peng H-T, Shastri BJ, Prucnal PR (2020) Photonic multiply-accumulate operations for neural networks. IEEE J Sel Top Quantum Electron 26(1):1–18. https://doi.org/10.1109/JSTQE.2019.2941485
Park E, Ahn J, Yoo S (2017) Weighted-entropy-based quantization for deep neural networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 7197–7205. https://doi.org/10.1109/CVPR.2017.761
Courbariaux M, Bengio Y, David J-P (2014) Training deep neural networks with low precision multiplications. https://doi.org/10.48550/ARXIV.1412.7024. arXiv:org/abs/1412.7024
Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K (2021) A survey of quantization methods for efficient neural network inference. https://doi.org/10.48550/ARXIV.2103.13630. arXiv:org/abs/2103.13630
Murmann B (2021) Mixed-signal computing for deep neural network inference. IEEE Trans Very Large Scale Integr VLSI Syst 29(1):3–13. https://doi.org/10.1109/TVLSI.2020.3020286
Sarpeshkar R (1998) Analog versus digital: extrapolating from electronics to neurobiology. Neural Comput 10(7):1601–1638
Giamougiannis G, Tsakyridis A, Moralis-Pegios M, Totovic AR, Kirtas M, Passalis N, Tefas A, Lazovsky D, Pleros N (2023) Universal linear optics revisited: new perspectives for neuromorphic computing with silicon photonics. IEEE Journal of Selected Topics in Quantum Electronics 29(2: Optical Computing), 1–16 https://doi.org/10.1109/JSTQE.2022.3228318
Giamougiannis G, Tsakyridis A, Moralis-Pegios M, Pappas C, Kirtas M, Passalis N, Lazovsky D, Tefas A, Pleros N (2023) Analog nanophotonic computing going practical: silicon photonic deep learning engines for tiled optical matrix multiplication with dynamic precision. Nanophotonics. https://doi.org/10.1515/nanoph-2022-0423
Giamougiannis G, Tsakyridis A, Moralis-Pegios M, Pappas C, Kirtas M, Passalis N, Lazovsky D, Tefas A, Pleros N (2022) High-speed analog photonic computing with tiled matrix multiplication and dynamic precision capabilities for dnns. In: 2022 European Conference on Optical Communication (ECOC), pp. 1–4
Mourgias-Alexandris G, Dabos G, Passalis N, Tefas A, Totovic A, Pleros N (2020) All-optical recurrent neural network with sigmoid activation function. In: Optical fiber communication conference (OFC) 2020, pp 3–5. Optica Publishing Group, https://doi.org/10.1364/OFC.2020.W3A.5. https://opg.optica.org/abstract.cfm?URI=OFC-2020-W3A.5
Mourgias-Alexandris G, Passalis N, Dabos G, Totović A, Tefas A, Pleros N (2021) A photonic recurrent neuron for time-series classification. J Lightwave Technol 39(5):1340–1347
Rosenbluth D, Kravtsov K, Fok MP, Prucnal PR (2009) A high performance photonic pulse processing device. Opt Express 17(25):22767–22772
Mourgias-Alexandris G, Tsakyridis A, Passalis N, Tefas A, Vyrsokinos K, Pleros N (2019) An all-optical neuron with sigmoid activation function. Opt Express 27(7):9620–9630
Pitris S, Mitsolidou C, Alexoudi T, Pérez-Galacho D, Vivien L, Baudot C, De Heyn P, Van Campenhout J, Marris-Morini D, Pleros N (2018) O-band energy-efficient broadcast-friendly interconnection scheme with sipho mach-zehnder modulator (mzm) and arrayed waveguide grating router (awgr). In: 2018 optical fiber communications conference and exposition (OFC), pp 1–3
Danial L, Wainstein N, Kraus S, Kvatinsky S (2018) Breaking through the speed-power-accuracy tradeoff in ADCs using a memristive neuromorphic architecture. IEEE Trans Emerg Topics Comput Intell 2(5):396–409. https://doi.org/10.1109/TETCI.2018.2849109
Garg S, Lou J, Jain A, Nahmias M (2021) Dynamic precision analog computing for neural networks. https://doi.org/10.48550/ARXIV.2102.06365. arXiv:org/abs/2102.06365
Wang K, Liu Z, Lin Y, Lin J, Han S (2018) HAQ: Hardware-aware automated quantization with mixed precision. https://doi.org/10.48550/ARXIV.1811.08886. arXiv:org/abs/1811.08886
Tishby N, Zaslavsky N (2015) Deep learning and the information bottleneck principle. In: Proc. IEEE information theory workshop, pp 1–5
Srinivas M, Patnaik LM (1994) Genetic algorithms: a survey. Computer 27(6):17–26. https://doi.org/10.1109/2.294849
Pascanu R, Mikolov T, Bengio Y (2012) On the difficulty of training recurrent neural networks. https://doi.org/10.48550/ARXIV.1211.5063. arXiv:org/abs/1211.5063
Nousi P et al (2019) Machine learning for forecasting mid-price movements using limit order book data. IEEE Access 7:64722–64736
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. https://doi.org/10.48550/ARXIV.1512.03385. arXiv:org/abs/1512.03385
Loshchilov I, Hutter F (2019) Decoupled weight decay regularization. In: International conference on learning representations. https://openreview.net/forum?id=Bkg6RiCqY7
Müller R, Kornblith S, Hinton GE (2019) When does label smoothing help? In: Advances in neural information processing systems 32 (NeurIPS 2019). https://proceedings.neurips.cc/paper_files/paper/2019/hash/f1748d6b0fd9d439f71450117eba2725-Abstract.html
Ntakaris A, Magris M, Kanniainen J, Gabbouj M, Iosifidis A (2018) Benchmark dataset for mid-price forecasting of limit order book data with machine learning methods. J Forecast 37(8):852–866
Acknowledgements
This project has received funding from the European Union’s Horizon 2020 research and innovation program under Grant Agreement No 871391 (PlasmoniAC). This publication reflects the authors’ views only. The European Commission is not responsible for any use that may be made of the information it contains.
Funding
Open access funding provided by HEAL-Link Greece.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kirtas, M., Passalis, N., Oikonomou, A. et al. Mixed-precision quantization-aware training for photonic neural networks. Neural Comput & Applic 35, 21361–21379 (2023). https://doi.org/10.1007/s00521-023-08848-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08848-8