Photonic-aware neural networks

Photonics-based neural networks promise to outperform electronic counterparts, accelerating neural network computations while reducing power consumption and footprint. However, these solutions suffer from physical layer constraints arising from the underlying analog photonic hardware, impacting the resolution of computations (in terms of effective number of bits), requiring the use of positive-valued inputs, and imposing limitations in the fan-in and in the size of convolutional kernels. To abstract these constraints, in this paper we introduce the concept of Photonic-Aware Neural Network (PANN) architectures, i.e., deep neural network models aware of the photonic hardware constraints. Then, we devise PANN training schemes resorting to quantization strategies aimed to obtain the required neural network parameters in the ﬁxed-point domain, compliant with the limited resolution of the underlying hardware. We ﬁnally carry out extensive simulations exploiting PANNs in image classiﬁcation tasks on well-known datasets (MNIST, Fashion-MNIST, and Cifar-10) with varying bitwidths (i.e., 2, 4, and 6 bits). We consider two kernel sizes and two pooling schemes for each PANN model, exploiting 2 (cid:2) 2 and 3 (cid:2) 3 convolutional kernels, and max and average pooling, the latter more amenable to an optical implementation. 3 (cid:2) 3 kernels perform better than 2 (cid:2) 2 counterparts, while max and average pooling provide comparable results, with the latter performing better on MNIST and Cifar-10. The accuracy degradation due to the photonic hardware constraints is quite limited, especially on MNIST and Fashion-MNIST, demonstrating the feasibility of PANN approaches on computer vision tasks.


Introduction
The recent advances in Machine Learning (ML), and specifically in Deep Learning (DL), have undoubtedly driven the current success of artificial intelligence. ML and DL models are deployed in an ever increasing number of applications, ranging from computer vision [1,2] to speech recognition [3,4] to fraud detection [5] and many others [6].
DL relies on Deep Neural Networks (DNNs), i.e., structures whose basic element is vaguely inspired by biological neurons. Neural networks have been traditionally implemented in electronic platforms (both CPUs and GPUs) based on the von Neumann architecture [7]. The high degree of programmability and the advancements in terms of processing capabilities of digital electronic architectures of the last decades enabled the great achievements of DNNs [8]. However, with the end of Moore's law and the exponential increase of DNN model complexity (at a pace of doubling every 3.4 months [9]), electronics is failing to keep up with the processing speed and energy efficiency required for large-scale deployment of complex models [10,11].
Many research efforts are, therefore, investigating alternative hardware solutions for the acceleration of DNNs [12,13]. In this context, photonics-based solutions attracted a lot of interest with the promise of outperforming electronic counterparts in speed, power consumption, and computing density [14]. Over the last few years, several Photonic Neural Network (PNN) architectures have been proposed exploiting integrated, fiber or free-space optics with the aim of accelerating DNN inference with analog computations in the photonic domain [15]. In [14], the authors highlight the advantages of performing multiplyaccumulate (MAC) operations in photonics in terms of energy ( [ 10 2 ), speed ( [ 10 3 ), and computing density ( [ 10 2 ). Photonics has also the potential to outperform high-speed GPUs in performing convolutions with a lower power consumption [16].
Furthermore, the photonic implementation of the nonlinearities needed in the activation functions of neurons has been recently investigated [17,18], in some cases exploiting an electro-optic hardware platform [19,20].
In the context of PNNs, the need of software frameworks for simulating training and inference operations of photonic neuromorphic architectures has been soon recognized. For this purpose, specialized frameworks, such as neuroptica [21] and neurophox [22], have been developed. These tools aim to emulate and train PNNs based on Mach-Zehnder interferometer (MZI) meshes. In particular, neuroptica provides several levels of abstractions, from the direct control of the MZI phase shifters, to the training of stacked structures using a Keras-like API. On the other hand, neurophox allows to train these chips using the Haar random technique developed in [23]. These simulators are, therefore, powerful tools, albeit specifically aimed to develop photonic accelerators based on MZI meshes.
In this paper, we focus on the abstraction of the physical layer constraints arising from generic analog photonic hardware, namely the limited resolution of the computations (in terms of effective number of bits), the requirement to use positive-valued inputs, and the limitations in the fanin and in the size of convolutional kernels. Based on these constraints, (i) we introduce the concept of Photonic-Aware Neural Network (PANN) architectures, developing photonic-hardware compliant DNN models exploiting the Larq library [24]; (ii) we devise PANN training schemes resorting to quantization strategies aimed to obtain suited neural network parameters. The performance of PANNs is then assessed in a case study concerning image classification on well-kwown datasets, demonstrating the limited accuracy degradation due to the constraints coming from the use of photonic hardware.
The remainder of this paper is structured as follows: in Sect. 2, we review the main photonic architectures aimed at DNN inference acceleration, with a focus on integrated approaches. Section 3 firstly highlights the constraints arising from the adoption of photonic hardware. Afterward, a training-to-inference strategy is discussed, as well as the developed photonic-aware DNN models. Section 4 presents and discusses the image classification results obtained on the different datasets. Finally, Section 5 concludes the paper.
2 Photonics for neural networks: state of the art A recurrent research theme at the border of optics and computing fields is the concept of a photonics-based computer [25,26], aimed to overcome, through optics the speed, the energy consumption and the computing density limits imposed by electronics. While some notable results have been achieved at the beginning of the century [27][28][29], the interest in developing a digital optical computer based on logic gates has then faded. Despite the lower speed of electronics, the level of integration and energy efficiency enabled by CMOS process advancements overcame Photonic Integrated Circuits (PIC) in performing logic operations. However, in recent years, following the breakthrough of deep learning [30], a significant research effort was put to explore non-conventional architectures to reduce the hardware resources and the energy necessary to run DNNs [31][32][33]. In this scenario, photonics gained a renewed interest as an alternative platform to implement analog neuromorphic functionalities [14,15,34]. The goal of optical neuromorphic processors is not to replace digital computers, but to enable analog computations with high bandwidth, low latency, and high energy efficiency [35].
In this paper, we focus on photonic devices for accelerating deep learning inference. These photonic engines rely on the inherent parallelism and speed of optics to perform DNN computations (e.g., matrix-vector multiplications and pooling). As an example of the computing capacity enabled by photonic devices, 11 Tera-operations per second have been recently demonstrated using a fiberbased approach [36]. By pushing on integrated photonics solutions, optical processors promise also to significantly lower the power consumption for DNN computations, while increasing at the same time the footprint efficiency [37]. To unleash the potential of a drastic reduction in power consumption, several photonic solutions leverage passive components to implement weights, i.e., not requiring energy besides input generation and output acquisition. These solutions have the drawback of a limited speed at which weights can vary, in the hundreds or even tens of kHz range. Nevertheless, in weight-sharing architectures such as convolutional neural networks, this is not a heavy constraint as these parameters are slowly changing. In the following, we report a few photonic implementations of deep learning inference accelerators, with a focus on integrated solutions.
The coherent matrix multiplier depicted in Fig. 1a is a milestone in this field [38]. The silicon photonics-based device exploits a mesh of MZI to perform matrix multiplications on an input vector of coherent lightwaves. This is done by physically implementing the singular value decomposition theorem. In terms of DNN, this device implements a layer of fully connected (FC) neurons and thus, stacking several devices, a FC-DNN can be obtained. In the demonstration, however, the nonlinear activation function was emulated in software. Another experimentally validated coherent approach is shown in Fig. 1b, namely the optical linear algebra unit (OLAU). The OLAU basic element is composed of four MZIs in the form of a dual inphase and quadrature (IQ) modulator [39], typically used in optical communications, composed of two MZIs with a phase shifter at one MZI output, as highlighted in Fig. 1b. In the OLAU each dual IQ modulator implements two input-weight multiplications, whose results are sent to optical 3dB combiners to perform the accumulation. In this way, the OLAU carries out matrix-vector multiplications in a distributed manner and implements again an FC layer. Despite a great potential, the actual scalability of these devices is still limited due to impairments and losses [40].
Other solutions rely on Wavelength Division Multiplexing (WDM), i.e., the use of multiple channels routed on the same waveguides at different wavelengths. The strength of these architectures is that multiple wavelengths, and thus multiple inputs, can be broadcast to multiple neurons. In this context, the architecture exploiting a microring resonator (MRR) bank recently gained a lot of attention as several proof-of-concept PICs have been fabricated with this approach. These solutions exploit resonant structures, the MRRs, to selectively weigh the different wavelengths within the same waveguide, as exemplified in Fig. 1c. The result of several multiply-and-accumulate operations is encoded in the photocurrent of a balanced photodetector placed after the MRR bank. Thanks to the hybrid photonic-electronic approach, a nonlinearity can be applied to the photocurrent, modulating the photonic neuron output. An MRR-bank-based PIC has been recently used for compensating fiber nonlinearities in a long-haul transmission experiment [41]. On the other hand, in [42] the use of a Semiconductor Optical Amplifier (SOA)-based cross-connect to perform weighing in a WDM architectures is proposed, as sketched in Fig. 1d. The experimental demonstration has been carried out with an Indium Phosphide (InP)-based photonic integrated device working on the iris dataset and still exploiting digital electronic hardware for the nonlinear activation fuction. Nonetheless, the strength of this approach relies on the fact that SOAs can inherently compensate for optical losses and exhibit alloptical nonlinearities.
Focusing on serial architectures, Fig. 1e reports a photonic electronic multiply-accumulate neuron [43]. This architecture relies on an hybrid opto-electronic approach where multiplications are performed at high speed in the optical domain, while the accumulation is carried out by an analog electrical frontend. Two high-speed MZIs are employed to impress inputs and weights on an incoming lightwave. The modulated signals are received in a balanced photodetector whose photocurrent encodes the multiplication result. Successive results are accumulated as a charge onto a capacitance and ultimately read by an analog-to-digital converter with a nonlinear input-output characteristic, so that the nonlinear activation function is inherently applied.

Photonic-aware neural networks
In this section, we outline the process adopted to develop and train PANNs. In the first part, we describe the main constraints derived from the use of analog optical technologies. In the second part, we report the solutions adopted to translate hardware limits in software with a focus on the DNN training-to-inference strategy. Finally, we present the developed DNN models and the computer vision datasets used to assess their performance. We call the devised DNNs PANN Architectures and the method for obtaining the DNN parameters PANN Training.

Limitations due to the photonic hardware
As surveyed in Sect. 2, photonic engines are analog processors that perform matrix-vector multiplications at high speed and low power consumption, while exploiting integrated photonic solutions to reach unprecedented computing densities [14].  [14,42,43,45]).
An additional limitation imposed by photonic hardware concerns the inputs of the PANN architecture. Indeed, in most configurations inputs are required to be positive-valued, as they are encoded in the intensity of optical signals. This impacts the normalization function of the input data, as well as the activation function used in each neuron that should have positive-only outputs, such as the ReLU [46] or the Sigmoid [47]. Another constraint on the input side of photonic neurons is related to the maximum number of e Hybrid photonic-electronic multiply-accumulate neuron inputs (i.e., fan-in) to each neuron. Different photonic architectures are characterized by different constraints, as for instance MZI meshes can perform block operations at the expense of several electro-optical conversions, while in [43] the constraint arises from the electronic accumulation phase: the maximum number of inputs is about 200 for these implementations.
A final limitation due to the underlying hardware concerns the maximum kernel size in convolutional layers. Given the current experimentally validated photonic convolutional kernel implementations [40][41][42], the maximum kernel size is equal to 3 Â 3 elements.
The main constraints imposed by the photonic hardware are summarized in Table 1.

Photonic-aware neural network computations
As pointed out in Sect. 3.1, noise and distortions limit the resolution of analog computations, allowing to distinguish a limited set of equally spaced-levels, leading to a very coarse bit resolution (i.e., 6 bits). For this reason, the floating point type, typically used for classical neural network computation, cannot be exploited to emulate photonic hardware.
To overcome this issue, in this paper, we propose to exploit reduced-precision fixed-point type for PANN inference and present a suited training approach. Indeed, the direct quantization of the weights computed using floats typically significantly reduces the accuracy of the obtained DNN [48,49]. Dealing with this aspect, many approaches in the literature allow to use equally spaced types in neural networks [50][51][52][53]. All these strategies perform the training phase using the floating point type, however they take into account that the inference phase will be carried out using low bitwidth equally spaced types, consequently adjusting the weights. In particular, Rastegari et al. [51], Courbariaux and Bengio [52] implement binary operations in neural networks to achieve better performance at the expense of a very coarse granularity of inputs and weights, while [53] extends those works to equally spaced types with arbitrary precision by introducing a bitwidth-dependent quantization function.
We now focus on training DNNs taking into account the resolution constraints from the photonic hardware. The approach presented in this paper aims to emulate the underlying photonic architecture, satisfying the bitwidth constraint (i.e., the ENOB) by exploiting quantized weights in the fixed-point domain.
The behavior of the quantization process carried out during the training phase is reported in Fig. 2. The input quantizer and the kernel quantizer describe the way of quantizing the incoming inputs and weights, respectively. A quantized layer computes the activation y as: y ¼ rðf ðq kernel ðwÞ; q input ðxÞÞ þ bÞ with full precision weights w, arbitrary precision input x, layer operation f, output r, and bias b.
A very important aspect is the operation of the kernel quantizer q kernel ðwÞ during the training phase. The concept of latent weights is introduced aside quantized weights to be used by the photonic hardware [54]. Basically, latent weights are a full precision copy of the weights used throughout the training process. Their purpose is to accumulate the small changes from the gradients without loss of precision. During the forward pass, a quantized version of these weights is used. When the vector of updates for the weights is obtained, the latent weights are updated instead of the quantized weights. Once the model is trained, only the quantized weights are used for inference.
Besides weights, we need also to consider the derivation of the quantization function during back-propagation. Being discrete-valued, its derivative is 0 almost everywhere: thus, its gradient would stop the network from learning. To overcome this problem, the pseudo-gradient method is used for optimizing weights during back-propagation [55]. One of the most commonly used pseudogradients is the Straight-Through Estimator (STE), defined as: Essentially, it ignores the derivative of the quantization function, and the incoming gradient is evaluated as if the function was a clipped identity function. Regarding the inputs, they are quantized on the same number of bits used for weights. We implement the proposed PANN training-to-inference strategy by relying on the DoReFa quantizers implemented in Larq [24,53], since it provides a convenient way to flexibly define the bitwdiths on both inputs and weights.

Datasets and models
After introducing the photonic constraints and the quantization strategies, we focus on the datasets and models used in the experiments. In particular, three common benchmark datasets are used: • MNIST [56]: it is a dataset composed of 28 Â 28 8-bit grayscale images containing handwritten digits from 0 to 9. The training set is made of 60, 000 images, while the test set is composed of 10, 000 samples. All the input images for the three datasets, already satisfying the photonic constraint regarding positive-valued inputs, have been pre-processed to normalize all samples to the interval [0, 1]. This, however, leads to input values in floating-point domain (i.e., 32-bit numbers), not satisfying the photonic hardware constraints. For this reason, the developed PANN architectures include an input quantizer in the first layer. This differs from traditional quantization approaches, that typically keep the inputs in full-precision [53]. Regarding the models, two versions for each neural network model have been developed, exploiting 2 Â 2 and 3 Â 3 convolutional kernels, respectively. Specifically, the models used for MNIST and Fashion-MNIST are reported in Fig. 3. Instead the models used in Cifar-10 are slightly deeper, reported in Fig. 4. In all models, the ReLU has been exploited as the activation function, which ensures that the next layer inputs are positive. Moreover, a SoftMax layer has been employed as a final layer to carry out the classification.
In detail, the MNIST/Fashion-MNIST models have been developed based on the LeNet-5 architecture [59], with an increased depth to counteract the lower precision [60]. Four convolutional layers have been devised, each of them followed by a batch normalization (BN) layer. These layers are used to accelerate the training by reducing internal covariate shift [61]. Furthermore, a pooling layer is added after the first and second convolutional layers. Both an average and a max pooling have been tested. It is worth noting that the average pooling is of particular interest in the context of PNNs, since it is a linear operation, and hence, it can be easily implemented in photonics [62]. At the end of the structures, there are two fully connected (FC) layers, composed of 100 and 10 neurons, respectively.
The developed models are compliant with the fan-in limitation of 200: this is particularly critical in the flattening operation, i.e., when passing from a convolutional layer to an FC one. In the 2 Â 2 kernel model, the last convolutional layer has 12 feature maps of size 4 Â 4. This results in 192 features, compliant with the fan-in for the subsequent FC layer. Similarly, with 3 Â 3 kernel, 32 oneelement feature maps are obtained after the last convolutional layer. In this way, these two models are photoniccompliant and can be used on the MNIST and Fashion-MNIST datasets.
Concerning the Cifar-10 dataset, the same strategy has been used: two models have been defined, one for each kernel size. The developed architectures are derived from the Binary Nets [63], with some modifications to satisfy photonic constraints. The 2 Â 2 model exploits 5 convolutional layers, each one followed by a ReLU activation function and a BN layer. After the second, third, and fourth convolutional layer, pooling layers are employed. The third pooling layer allows the developed model to further reduce the number of features aimed to comply with the fan-in constraint. After flattening, three FC layers have been used, composed of 200, 100, and 10 neurons, respectively. The 3 Â 3 model is very similar to the 2 Â 2 model with one except for the absence of the third pooling layer: in this case, the fan-in constraint is satisfied due to the larger kernel size that shrinks down more the number of features.
Regarding the training phase, all the models have been trained using Adam as optimizer [64]. In details, the MNIST/Fashion-MNIST models have been trained for 30

PANN results
In this section, we discuss the results obtained by the models suited for photonic implementation on the three benchmark datasets i.e., MNIST, Fashion-MNIST, and Cifar-10. The experiments have been conducted using DoReFa with a varying number of bit resolution (i.e., 2, 4, and 6 bits) on two model versions, with 2 Â 2 and 3 Â 3 kernels. Additionally, for each model we have considered two pooling variants, exploiting either max pooling or average pooling.
To make a fair comparison, PANN architectures (blue line in the following figures) have been compared with two baselines: (i) float, exploiting a 32-bit floating-point architecture (black line), and (ii) PANN w/ float input where just the input of the first layer is not quantized (orange line).

MNIST
The accuracy of the models with max pooling on MNIST dataset is shown in Fig. 5 as a function of the bit resolution of the quantized parameters.
The figure shows a limited accuracy degradation when exploiting PANN models. Indeed, in the worst case (i.e., 2 Â 2 kernel with 2 bits) an accuracy drop of about 1.3% can be observed with respect to the float baseline. Moreover, it can be noticed that the input quantization itself has a certain impact on the accuracy, especially at low bitwidths.
The results obtained on MNIST dataset using average pooling are reported in Fig. 6. Both 2 Â 2 and 3 Â 3 kernel sizes show very good performance. In the case of 3 Â 3 kernel using 6 bits, the accuracy degradation with respect to the baseline model is very low (i.e., 0.26%), showing that the photonic hardware can approach the performance of the electronic counterpart. Comparing the pooling options, slightly better results are obtained with the average pooling, even if the difference are very small (0.15% on average).

Fashion-MNIST
The results obtained by PANN models on the Fashion-MNIST dataset using max pooling are reported in Fig. 7. The performance degradation with respect to the float baseline is slightly increased due to the more complex structure of the dataset. When using 2 bits, the accuracy drop is 5.8% (5.4%) with the 2 Â 2 (3 Â 3) kernel. However, by exploiting a higher number of bits, the difference between the photonic models and the floating-point architectures can be significantly reduced, up to 1.1% using six bits in both kernel configurations. Considering the PANN with floating point input, the input quantization reduces the accuracy up to 1.3%, however this impact can be made negligible by increasing the bitwidth, as shown in the 6-bit configurations.
The results with average pooling are shown in Fig. 8. The overall behavior is similar to max pooling: indeed even in this case the 3 Â 3 kernel models perform better than 2 Â 2 models. The PANN accuracy decreases with respect to max pooling except for the 3 Â 3 on two bits. The performance drop with respect to the float baseline is 2.1% in the best case, i.e., 3 Â 3 kernel on six bits.

Cifar-10
The accuracy of PANN models on Cifar-10 dataset using max pooling is shown in Fig. 9. In this case, the gap between the PANN models and the floating point baseline is higher, about 8% in the best case (i.e., 3 Â 3 kernel on six bits). When using just two bits, the accuracy drop reaches 20% (18.2%) for the 2 Â 2 (3 Â 3) kernel models, with an 8% drop caused by the input quantization. This is due to the complexity of the Cifar-10 dataset, which is composed of RGB images (i.e., 3-channel images) with more features compared to the MNIST/Fashion-MNIST datasets. Again, the input quantization issue can be solved by using higher bitwidths, indeed no impact can be observed when operating at a bitwidth of 6.
The performance achieved by the average pooling on Cifar-10 is reported in Fig. 10.
In this scenario, the accuracy is slightly higher compared to the max pooling, reaching a drop of 6.4% in the best case (3 Â 3 kernel size on 6 bits). Thus, the average pooling allows to decrease the gap with both the baselines (i.e., also reducing the impact of the input quantization).

Conclusion
The breakthrough of deep learning has recently led to a renewed interest on photonics for computing. Several optical architectures for deep learning have been investigated, focusing on integrated approaches. Such photonic accelerators promise to bring substantial improvements in terms of speed up and power consumption and footprint reduction for DNN inference. Exploiting these analog processors comes at the cost of some limitations: computations are carried out on analog signals with limited resolution ( 6 bits with equally spaced levels) with positive-valued inputs, the fan-in to photonic neurons is limited to a couple of hundreds inputs, and convolutional kernels are demonstrated up to a 3 Â 3 size. These limitations mainly derive from the lack of mature photonic technologies, leading to undesired losses and impairments. While in the foreseeable future some of these aspects are likely to be improved, the exploitation of photonic hardware for DNN acceleration in the shortmedium term requires AI models that are compliant with current technology.
In this paper, we, therefore, introduced the concept of PANN architectures, i.e., DNN models compliant with the constraints imposed by the photonic hardware. Moreover, we devised a quantization-based PANN training-to-inference scheme to obtain neural network weights in the fixedpoint domain suited for the underlying photonic architecture.
The performance of these DNN models has been then assessed in computer vision tasks. During our experiments, we considered two kernel sizes, namely 2 Â 2 and 3 Â 3, and two pooling schemes, i.e., max pooling model and average pooling. The impact of the different bitwidths (2, 4, and 6 bits) on the accuracy has been reported and discussed. Thanks to the higher number of parameters and the higher computational precision, models with larger kernel size and higher bitwidths achieve higher accuracy. Indeed in all three datasets, the highest accuracy is reached by the 3 Â 3 kernel-sized model with six bits. Specifically, in the best case we are able to reach an accuracy of 99.2% on MNIST, 90.2% on Fashion-MNIST, and 75.4% on Cifar-10. When compared to the floating point baseline the PANN architectures suffer a very limited drop in accuracy on MNIST and Fashion-MNIST (i.e., 0.3% and 1.1% in the best case, respectively) and slightly higher on Cifar-10 (i.e., 6.5%). The impact of input quantization is negligible for all tested configurations, unless just two bits are exploited on Cifar-10 dataset. Moreover, max and average pooling schemes achieve similar results, with the latter Fig. 9 Results on Cifar-10 dataset with max pooling Fig. 10 Results on Cifar-10 dataset with average pooling even outperforming the former in all configurations of both MNIST and Cifar-10, thus enabling the use of all-optical average pooling, which can be easily realized with passive devices.
These results show the feasibility of DNN operations using photonic hardware. However, further development and investigations are required to improve the scalability of the underlying photonic hardware in terms of bit resolution and trade-off with weight update frequency, neuron fan-in and kernel size [65].
Funding Open access funding provided by CRUI-CARE.

Declarations
Conflict of interests All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons. org/licenses/by/4.0/.