Operational neural networks

Feed-forward, fully connected artificial neural networks or the so-called multi-layer perceptrons are well-known universal approximators. However, their learning performance varies significantly depending on the function or the solution space that they attempt to approximate. This is mainly because of their homogenous configuration based solely on the linear neuron model. Therefore, while they learn very well those problems with a monotonous, relatively simple and linearly separable solution space, they may entirely fail to do so when the solution space is highly nonlinear and complex. Sharing the same linear neuron model with two additional constraints (local connections and weight sharing), this is also true for the conventional convolutional neural networks (CNNs) and it is, therefore, not surprising that in many challenging problems only the deep CNNs with a massive complexity and depth can achieve the required diversity and the learning performance. In order to address this drawback and also to accomplish a more generalized model over the convolutional neurons, this study proposes a novel network model, called operational neural networks (ONNs), which can be heterogeneous and encapsulate neurons with any set of operators to boost diversity and to learn highly complex and multi-modal functions or spaces with minimal network complexity and training data. Finally, the training method to back-propagate the error through the operational layers of ONNs is formulated. Experimental results over highly challenging problems demonstrate the superior learning capabilities of ONNs even with few neurons and hidden layers.


I. INTRODUCTION
The conventional fully-connected and feed-forward neural networks such as Multi-Layer Perceptrons (MLPs) and Radial Basis Functions (RBFs), are universal approximators.Such networks optimized by iterative processes [1] [2], or even formed by random architectures and solving a closed-form optimization problem for the output weights [3], can approximate any continuous function providing that the employed neural units (i.e., the neurons) are capable of performing nonlinear piecewise continuous mappings of the receiving signals and that the capacity of the network (i.e. the number of layers' neurons) is sufficiently high.The standard approach in using such traditional neural networks is to manually define the network's architecture (i.e. the number of neural layers, the size of each layer) and use the same activation function for all neurons of the network.
While there is recently a lot of activity in searching for good network architectures based on the data at hand, either progressively [4], [5] or by following extremely laborious search strategies [6]- [10], the resulting network architectures may still exhibit a varying or entirely unsatisfactory performance levels especially when facing with highly complex and nonlinear problems.This is mainly due to the fact that all such traditional neural networks employ a homogenous network structure consisting of only a crude model of the biological neurons.This neuron model is capable of performing only the linear transformation (i.e., linear weighted sum) [12] while the biological neurons or neural systems in general are built from a large diversity of neuron types with heterogeneous, varying structural, biochemical and electrophysiological properties [13]- [18].For instance, in mammalian retina there are roughly 55 different types of neurons to perform the low-level visual sensing [16].Therefore, while these homogenous neural networks are able to approximate the responses of the training samples, they may not learn the actual underlying functional form of the mapping between the inputs and the outputs of the problem.
There have been some attempts in the literature to modify MLPs by changing the neuron model and/or conventional BP algorithm [19]- [21], or the parameter updates [22], [23]; however, their performance improvements were not significant in general, since such approaches still inherit the main drawback of MLPs, i.e., homogenous network configuration with the same (linear) neuron model.Extensions of the MLP networks particularly for end-to-end learning of 2D (visual) signals, i.e.Convolutional Neural Networks (CNNs), and time-series data, i.e.Recurrent Neural Networks (RNNs) and Long Short-Term Memories (LSTMs), naturally inherit the same limitations originating from the traditional neuron model.In biological learning systems, the limitations mentioned above are addressed at the neuron cell level [24].In the mammalian brain and nervous system, each neuron (Figure 1) conducts the electrical signal over three distinct operations: 1) synaptic connections in Dendrites: an individual operation over each input signal from the synapse connection of the input neuron's axon terminals, 2) a pooling operation of the operated input signals via spatial and temporal signal integrator in the Soma, and finally, 3) an activation in the initial section of the Axon or the so-called Axon hillock: if the pooled potentials exceeds a certain limit, it "activates" a series of pulses (called action potentials).As shown in the right side of Figure 1 each terminal button is connected to other neurons across a small gap called synapse.The physical and neurochemical characteristics of each synapse determine the signal operation which is Operational Neural Networks Serkan Kiranyaz 1 , Turker Ince2 , Alexandros Iosifidis 3 and Moncef Gabbouj 4nonlinear in general [25], [26] along with the signal strength and polarity of the new input signal.Information storage or processing is concentrated in the cells' synaptic connections, or more precisely through certain operations of these connections together with the connection strengths (weights) [25].Accordingly, in neurological systems, several distinct operations with proper weights (parameters) are created to accomplish such diversity and trained in time to perform or "to learn" many neural functions.Biological neural networks with higher diversity of computational operators have more computational power [28], and it is a fact that adding more neural diversity allows the network size and total connections to be reduced [24].Motivated by these biological foundations, a novel feed-forward and fully-connected neural network model, called Generalized Operational Perceptrons (GOPs) [32], [33], has recently been proposed to accurately model the actual biological neuron with varying synaptic connections.In this heterogeneous configuration a superior diversity appearing in biological neurons and neural networks has been accomplished.More specifically, the diverse set of neurochemical operations in biological neurons (the non-linear synaptic connections plus the integration process occurring in the soma of a biological neuron model) have been modelled by the corresponding "Nodal" (synaptic connection) and "Pool" (integration in soma) operators whilst the "Activation" operator has directly been adopted.An illustrative comparison between the traditional Perceptron neuron in MLPs and the GOP neuron model is illustrated in Figure 2. Based on the fact that actual learning occurs in the synaptic connections with non-linear operators in general, those all-time-fixed linear model of MLPs can now be generalized by the GOP neurons that allow any (blend of) non-linear transformations to be used for defining the input signal transformations at the neuron level.Based on the fact that the GOP neuron naturally became a superset of linear Perceptrons (MLP neurons), GOPs provide an oportunity to better encode the input signal using linear and non-linear fusion schemes and, thus, lead to more compact neural network architectures achieving highly superior performance levels, e.g., the studies [32] and [33] have shown that GOPs can achieve elegant performance levels on many challenging problems where MLPs entirely fail to learn such as "Two-Spirals", "N-bit Parity" for N>10, "White Noise Regression", etc.As being the superset, a GOP network may fall back to a conventional MLP only when the learning process defining the neurons' operators indicates that the native MLP operators should be used for the learning problem in hand.In this study, a novel neuron model is presented for the purpose of generalizing the linear neuron model of conventional CNNs with any non-linear operator.As an extension of the perceptrons, the neurons of a CNN perform the same linear transformation (i.e., the linear convolution, or equivalently, the linear weighted sum) as perceptrons do, and, it is, therefore, not suprising that in many challenging problems only the deep CNNs with a massive complexity and depth can achieve the required diversity and the learning performance.The main objective of this study is to propose a novel network model, the "Operational Neural Networks (ONNs)" based on this new neuron model.A novel training method is then formulated to back-propagate the error through the operational layers of ONNs.With the right operator set we shall show that ONNs even with a shallow and compact configuration and under severe restrictions (i.e., scarce and low-resolution train data, shallow training, limited operator library, etc.) can achieve an elegant learning performance over such challenging visual problems (e.g., image denoising, synthesis, transformation and segmentation) that can defy the conventional CNNs having the same or even higher network complexities.In order to perform an unbiased evaluation and direct comparison between the convolutional and operational neurons/layers, we shall avoid using the fully connected layers in both network types.This is a standard practice used by many state-of-the-art CNN topologies today.

Linear (MLP) Neuron Operational (GOP) Neuron
The rest of the paper is organized as follows: In Section II, a brief review on GOPs is presented in order to highlight the motivation of a heterogeneous and nonlinear network.The formulations and BP training of GOPs are detailed in Appendix A. Based on the philosophy and foundations revealed in this section, Section III will then present the proposed ONNs where the formulations and the implementation details of the novel BP methodology are presented in Appendices B and C. Section IV presents a rich set of experiments to perform comparative evaluations between the learning performances of ONNs and CNNs over four challenging problems.A detailed computational complexity analysis between the two network types will then be presented in Appendix D. Finally, Section V concludes the paper and suggests topics for future research.

A. Generalized Operational Perceptrons (GOPs)
GOPs are the reference point for the proposed ONNs as they share the main philosophy of generalizing the conventional homogenous network using a fixed linear neuron model by a heterogeneous network with an "operational" neuron model which can encapsulate any set of (linear or non-linear) operators.As illustrated in Figure 2, the conventional feed-forward and fully-connected ANNs, or the so-called Multi-layer Perceptrons (MLPs), have the following linear model: This means that the output of the previous layer neuron's output, , contributes inputs of all neurons in the next layer, l+1.Then a nonlinear (or piece-wise linear) activation function is applied to all the neurons of layer l+1 in an element-wise manner.In a GOP neuron, the linear model of Eq. ( 1) and the activation function have been replaced by a set of three operators: a nodal operator, Ψ , a pool operator, Ρ and finally the activation operator, . The nodal operator models a synaptic connection with a certain neurochemical operation.The pool operator models the integration (or fusion) operation performed in the Soma and finally, the activation operator encapsulates any activation function.Therefore, the output of the previous layer neuron, , still contributes to all the neurons' inputs in the next layer with the individual operator set of each neuron, i.e., Ρ Ψ , , . . ., Ψ , , . . ., ∀ ∈ 1, Comparison of Eq. ( 2) with Eq. ( 1) reveals the fact that when Ψ , .and Ρ ∑ .then the GOP neuron will be identical to a MLP neuron.However, in this relaxed model, now the neurons can get any proper nodal, pool and activation operator so as to maximize the learning capability.For instance, the nodal operator library, Ψ , can be composed of: multiplication, exponential, harmonic (sinusoid), quadratic function, Gaussian, Derivative of Gaussian (DoG), Laplacian of Gaussian (LoG), Hermitian, etc.Similarly, the pool operator library, , can include: summation, n-correlation, maximum, median, etc.Typical activation functions that suit to classification problems can be combined within the activation operator library, , composed of, e.g., tanh, linear, lin-cut, binary, etc.As in conventional MLP neuron, the i th GOP neuron at layer l+1 has the connection weights to each neuron in the previous layer, l; however, each weight is now the internal parameter of its nodal operator, Ψ , not necessarily the scalar weight of the output.

B. Training with Back-Propagation
The conventional Back-Propagation (BP) training consists of one forward-propagation (FP) pass to compute the error at the output layer following with an error back-propagation pass starting from the output layer back to the 1 st hidden layer, in order to calculate the individual weight and bias sensitivities in each neuron.The most common error metric is the Mean Square Error (MSE) in the output layer that can be expressed as follows: , . . . ., ∑ .
Due to the space limitations, we detailed the formulations of BP for GOPs vs. MLPs in Appendix A.

III. OPERATIONAL NEURAL NETWORKS
The convolutional layers of conventional 2D CNNs share the same neuron model as in MLPs with two additional restrictions: limited connections and weight sharing.Without these restrictions, every data point (or pixel) in a feature map in layer l would be connected to every pixel of feature map in layer l-1 and this would create an unfeasibly large number of connections and weights that cannot be optimized efficiently.Instead, by these two constraints a pixel in the current layer will now be connected only to the corresponding neighboring pixels in the previous layer (limited connections) and the amount of connections can be determined by the size of the kernel (filter).Moreover, the connection weights of the kernel will be shared for each pixel-topixel connection (weight sharing).By these restrictions, the linear weighted sum as expressed in Eq. ( 1) for MLPs will turn into the convolution formula used in CNNs.This is also evident in the illustration in Figure 3 (left) where the three consecutive convolutional layers without the sub-sampling (pooling) layers are shown.So, the input map of the next layer neuron, , will be obtained by cumulating the final output maps, of the previous layer neurons convolved with their individual kernels, , as follows: ONNs share the essential idea of GOPs and extends the linear convolutions in the convolutional layers of CNNs by the nodal and pool operators.In this way, the operational layers and neurons constitute the backbone of an ONN and other properties such as weight sharing and limited (kernel-wise) connectivity are common with a CNN.The three consecutive operational layers and the k th neuron of the sample ONN with 3x3 kernels and M=N=22 input map sizes in the previous layer is shown in Figure 3 (right).The input map of the k th neuron at the current layer, , is obtained by pooling the final output maps, of the previous layer neurons operated with its corresponding kernels, , as follows: A direct comparison between Eqs. ( 4) and ( 5) will reveal the fact that when the pool operator is the "summation", Σ, and the nodal operator is "multiplication", Ψ , , , , , , for all neurons of an ONN, then the resulting homogenous ONN will be identical to a CNN.Therefore, as GOPs are a superset of MLPs, ONNs are too a superset of CNNs.

Convolutional Layers of CNNs
Operational Layers of ONNs

A. Training with Back-Propagation
The formulation of BP training consists of four distinct phases: 1) Computation of the delta error, ∆ , at the output layer, 2) Inter BP between two operational layers, 3) Intra BP in an operational neuron, and 4) Computation of the weight (operator kernel) and bias sensitivities in order to update them at each BP iteration.Phase-3 also takes care of up-or down-sampling (pooling) operations whenever they are applied in the neuron.The detailed derivation of the BP is given in Appendix B.

B. Implementation
To bring an ONN to a run-time functionality both FP and BP operations should properly be implemented based on the four phases detailed earlier.Then the optimal operator set per neuron in the network can be searched by short BP training sessions with potential operator set assignments.Finally, the ONN with the best operators can be trained over the train dataset of the problem.All implementation details are provided in Appendix C.

IV. EXPERIMENTAL RESULTS
In this section we perform comparative evaluations between conventional CNNs and ONNs over four challenging problems: 1) Image Synthesis, 2) Denoising, 3) Face Segmentation, and 4) Image Transformation.In order to demonstrate the learning capabilities of the ONNs better, we have further taken the following restrictions: for BP training will be kept low (e.g.max.80 and 240 iterations for GIS and regular BP sessions, respectively).
For a fair evaluation, we shall first apply the same restrictions over the CNNs; however, we shall then relax them to find out whether CNNs can achieve the same learning performance level with, e.g., more complex configuration with deeper training over the simplified problem.

A. Experimental Setup
In any BP training session, for each iteration, t, with the MSE obtained at the output layer, E(t), a global adaptation of the learning rate, ε, is performed within the range [5.10 -1 , 5.10 -5 ], as follows: where α=1.05 and β=0.7, respectively.Since BP training is a stochastic gradient descent method, for each problem we shall perform 10 BP runs, each with random parameter initialization.
The operator set library that is used to form the ONNs to tackle the challenging learning problems in this study is composed of a few essential nodal, pool and activation operators.Table 1 presents the 7 nodal operators along with their derivatives, Ψ 1 and Ψ 1 with respect to the weight, , and the output, of the previous layer neuron.Similarly, Table 2 presents the two common pool operators and their derivatives with respect to the nodal term, , .Finally, Table 3 presents the two common activation functions (operators) and their derivatives.Using these lookup tables, the error at the output layer can be back-propagated and the weight sensitivities can be computed.The top section of Table 4 enumerates each potential operator set and the bottom section presents the index of each individual operator set in the operator library, Θ, which will be used in all experiments.There is a total of N=7x2x2=28 sets that constitute the operator set library, * .Let : , , be the i th operator set in the library.Note that the first (default) operator set, : 0,0,0 represents the native operators of a CNN that performs linear convolutions with the traditional activation function, tanh.In accordance with the activation operators used, the dynamic range of the input/output images in all problems are normalized in the range of [-1, 1] as follows: where p is the i th pixel value in an image, .
As mentioned earlier, the same compact network configuration with only two hidden layers and a total of 48 hidden neurons, Inx16x32xOut is used in all the experiments.The Function . ., , , . .
first hidden layer applies sub-sampling by 2, and the second one applies up-sampling by 2.

B. Evaluation of the Learning Performance
In order to evaluate the learning performance of the ONNs for the regression problems, image denoising, synthesis and transformation, we used the Signal-to-Noise Ratio (SNR) evaluation metric, which is defined as the ratio of the signal power to noise power, i.e., 10log / .The ground-truth image is the original signal and its difference to the actual output yields the "noise" image.For the (face) segmentation problem, with train and test partitions, we used the conventional evaluation metrics such as classification error (CE) and F1.Given the ground-truth segmentation mask, the final segmentation mask is obtained from the actual output of the network by SoftMax thresholding.With a pixel-wise comparison, Accuracy (Acc), which is the ratio of the number of correctly classified pixels to the total number of pixels, Precision (P), which is the rate of correctly classified object (face) pixels in all pixels classified as "face", and Recall (R), which is the rate of correctly classified "face" pixels among all true "face" pixels can be directly computed.Then 1 and 1 2 / . The following sub-sections will now present the results and comparative evaluations of each problem tackled by the proposed ONNs and conventional CNNs.

1) Image Denoising
Image denoising is a popular field where deep CNNs have recently been applied and achieved the state-of-the-art performance [36]- [39].This was an expected outcome since "convolution" is the basis of the linear filtering and a deep CNN with thousands of sub-band filters that can be tuned to suppress the noise in a near-optimal way is a natural tool for image denoising.Therefore, in this particular application we are in fact investigating whether stacked non-linear filters in an ONN can also be tuned for this task and if so, whether it can outperform its linear counterparts.
In order to perform comparative evaluations, we used the 1500 images from Pascal VOC database.The gray-scaled and downsampled original images are the target outputs while the images corrupted by and Gaussian White Noise (GWN) are the input.The noise level is kept very high on purpose, i.e., all noisy images have SNR 0dB.The dataset is then partitioned them into train (10%) and test (90%) with 10-fold cross validation.So, for each fold, both network types are trained 10 times by BP over the train (150 images) partition and tested over the rest (1350 images).To evaluate their best learning performances for each fold, we selected the best performing networks (among the 10 BP training runs with random initialization).Then the average (of the best) performances (over both train and test partitions) of the 10-fold cross validation are compared for the final evaluation.
For ONNs, the layer-wise GIS for best operator set is performed only once (only for the 1 st fold) and then the same operator set is used for all the remaining folds.Should it be performed for all the folds, it is likely that different operators sets that could achieve even higher learning performance levels could have been found for ONNs.To further speed up the GIS, as mentioned earlier we keep the output layer as a convolutional layer whilst optimizing only the two hidden layers by GIS.For this problem (over the 1 st fold), GIS results in operator indices as 9 for both layers, and it corresponds to the operator indices: 9:{0, 1, 2} for the pool (summation=0), activation (linear-cut=1) and nodal (sin=2), respectively.

2) Image Synthesis
In this problem we aim to test whether a single network can (learn to) synthesize one or many images from WGN images.This is harder than the denoising problem since the idea is to use the noise samples for creating a certain pattern rather than suppressing them.To make the problem even more challenging, we have trained a single network to (learn to) synthesize 8 (target) images from 8 WGN (input) images, as illustrated in Figure 6.We repeat the experiment 10 times (folds), so 8x10=80 images randomly selected from the Pascal VOC dataset.The gray-scaled and down-sampled original images are the target outputs while the WGN images are the input.For each trial, we performed 10 BP runs each with random initialization and we select the best performing network for each run for comparative evaluations.As in the earlier application, the layer-wise GIS for searching the best operator set is performed only once (only for the 1 st fold) for the two hidden operational layers of ONNs and then the same operator set is used for all remaining folds.Hence over the 1 st fold, GIS yields the top-ranked operator set with the operator indices as 3 and 13 for the 1 st and 2 nd hidden layers, which correspond to the operator indices: 1) (0, 0, 3) for the pool (summation=0), activation (tanh=0) and nodal (exp=3), respectively, and 2) (0, 1, 6) for the pool (summation=0), activation (linear-cut=1) and nodal (chirp=6), respectively.

Input Target Output ONN (1x16x32x1)
Figure 7 shows the SNR plots of the best CNNs and ONNs among the 10 BP runs for each synthesis experiment (fold).Several interesting observations can be made from these results.First, the best SNR level that CNNs have ever achieved is below 8dB while this is above 11dB for ONNs.A critical issue is that at the 4 th synthesis fold, neither of the BP runs is able to train the CNN to be able to synthesize that batch of 8 images (SNR < -1.6dB).Obviously, it either requires more BP runs than 10 or more likely, it requires a more complex/deeper network configuration.On the other hand, ONNs never failed to achieve a reasonable synthesis performance as the worst SNR level (from fold 3) is still higher than 8dB.The average SNR levels of the CNN and ONN synthesis are 5.02dB and 9.91dB, respectively.Compared to the denoising problem, the performance gap significantly widened since this is now a much harder learning problem.For a visual comparative evaluation, Figure 8 shows a random set of 14 synthesis outputs of the best CNNs and ONNs with the target image.The performance gap is also clear here especially some of the CNN synthesis outputs have suffered from severe blurring and/or textural artefacts.

3) Face Segmentation
Face or object segmentation (commonly referred as "Semantic Segmentation") in general is a common application domain especially for deep CNNs [41]- [50].In this case the input is the original image and the output is the segmentation mask which can be obtained by simply thresholding the output of the network.In this section we perform comparative evaluations between CNNs and ONNs for face segmentation.In [41] an ensemble of compact CNNs was tested against a deep CNN and this study has shown that a compact CNN with few convolutional layers and dozens of neurons is capable of learning certain face patterns but may fail for other patterns.This was why an ensemble of compact CNNs was used in a "Divide and Conquer" paradigm.
In order to perform comparative evaluations, we used FDDB face detection dataset [51].FDDB dataset contains 1000 images with one or many human faces in each image.We keep the same experimental setup as in image denoising application: The dataset is partitioned them into train (10%) and test (90%) with 10-fold cross validation.So for each fold, both network types are trained 10 times by BP over the train (100 images) partition and tested over the rest (900 images).To evaluate their best learning performances for each fold, we selected the best performing networks (and their average performances (over both train and test partitions) of the 10-fold cross validation are compared for the final evaluation.
Figure 9 shows F1 plots of the best CNNs and ONNs at each fold over both partitions.The average F1 scores of the CNN vs. (ONN-1 and ONN-3) segmentation for the train and test partitions are: 58.58% vs. (87.4% and 79.86%), and 56.74% vs. (47,96% and 59.61%), respectively.As expected, ONN-1 has achieved the highest average F1 in all folds on the train partition and this is around 29% higher than the segmentation performance of the CNNs.Despite of its compact configuration, this indicates an "Over-fitting" since its average generalization performance over the test partition is around 8% lower than the average F1 score of CNN.Nevertheless, ONN-3 shows a superior performance level in both train and test partitions by around 21% and 3%, respectively.Since GIS is performed over the train partition, ONN-3 may, too, suffer from over-fitting as there is a significant performance gap between the train and test partitions.This can be addressed, for instance, by performing GIS over a validation set to find out the (near-) optimal operator set that can generalize the best.

4) Image Transformation
Image transformation (or sometimes called as image translation) is the process of converting one (set of) image(s) to another.Deep CNNs have recently been used for certain image translation tasks [52], [53] such as edge-to-image, gray-scale-to-color image, dayto-night (or vice versa) photo translation, etc.In all these applications, the input and output (target) images are closely related.In this study we tackled a more challenging image transformation, which is transforming an image to an entirely different image.This is also much harder than the image synthesis problem because this time the problem is the creation of a (set of) image(s) from another with a distinct pattern and texture.To make the problem even more challenging, we have trained a single network to (learn to) transform 4 (target) images from 4 input images, as illustrated in Figure 10 (left).In the first fold, we have further tested whether the networks are capable of learning the "inverse" problems, which means, the same network have to transform a pair of input images to another pair of output images and also do the opposite (output images become the input images and vice versa).Images used in the first fold are shown in Figure 10 (left).We repeat the experiment 10 times using the close-up "face" images most of which obtained from the FDDB face detection dataset [51].The gray-scaled and down-sampled images are used both input and output.For CNNs we performed 10 BP runs each with random initialization and for comparative evaluations we select the best performing network for each run.

Network
For ONNs, we perform 2-pass GIS for each fold and each BPrun within the GIS is repeated 10 times to evaluate the next operator set assigned.
Network In the first fold, the outputs of both networks are shown in Figure 10 (right).The GIS results in the optimal operator set that has the operator indices as 0 and 13 for the 1 st and 2 nd hidden layers, and this corresponds to the operator indices: 1) 0:{0, 0, 0} for the pool (summation=0), activation (tanh=0) and nodal (mul=0), respectively, and 2) 13:{0,1,6} for the pool (summation=0), activation (lin-cut=1) and nodal (chirp=6), respectively.The average SNR level achieved is 10.99 dB, which is one of the highest SNR achieved among all 10 folds despite the fact that in this fold ONNs are trained for the transformation of two inverse problems.On the other hand, we had to use three distinct configurations for CNNs.Because the CNN with the default configuration, and the populated configuration, CNNx4 that is a CNN with the number of hidden neurons twice the default number (2x48=96 neurons), both failed to perform a reasonable transformation.Even though CNNx4 has twice as many hidden neurons (i.e., 1x32x64x1) and around 4-times more parameters, the best BP training among 10 runs yield the average SNR = 0.172 dB, which is slightly higher than the average SNR = 0.032 dB obtained by the CNN with the default configuration.Even though we then simplified the problem significantly by training a single CNN for transforming only one image (rather than 4) whilst still using the CNNx4 configuration, the average SNR improved to 2.45dB which is still far below the acceptable performance level since the output images are totally unrecognizable.
Figure 11 shows the results for the image transformations of the 3 rd and 4 th folds.A noteworthy difference with respect to the 1 st fold is that in both folds, the 2-pass GIS results in a different operator set for the 1 st hidden layer, which has the operator indices: 1) 3:{0,0,3} for the pool (summation=0), activation (tanh=0) and nodal (exp=3), respectively.The average SNR levels achieved are 10.09 dB and 13.01 dB, respectively.In this figure, we skipped the outputs of the CNN with the default configuration since, as in the 1 st fold, it has entirely failed (i.e., average SNRs are -0.19 dB and 0.73 dB, respectively).This is also true for the CNNx4 configuration even though a significant improvement is observed, i.e., average SNR levels are 1.86 dB and 2.37 dB, respectively.An important observation is that, these levels are significantly higher than the corresponding SNR level for the 1 st fold since both folds (transformations) are relatively easier than the transformation of the two inverse problems in the 1 st fold.However, the transformation quality is still far from satisfactory.Finally, when the problem is significantly simplified as before, that is, a single CNN is trained to learn transformation for only one image pair (11), then CNNx4 can then achieve the average SNR level of 2.54dB, which still makes it far from being satisfactory.This is true for the remaining folds and over the 10 folds, the average SNR levels for ONNs, CNNs, and the two CNNx4 configurations are: 10.42 dB, -0.083 dB, 0.24 dB (44) and 2.77 dB (11), respectively.This indicates that a significantly more complex and deeper configuration is needed for CNNs to achieve a reasonable transformation performance.In order to further investigate the role of the operators on the learning performance, we keep the log of operator sets evaluated during the 2-pass GIS.For the 1 st fold of the image transformation problem, Figure 12 shows the average MSE obtained during the 2-pass GIS.Note that the output layer's (layer-3) operator set is fixed as, 0:{0,0,0} in advance and excluded from GIS.This plot clearly indicates which operator sets are the best-suited for this problem and which are not.Obviously the operator sets with indices, 6:{0, 0, 6} and 13:{0, 1, 6} in layer-2 got the top ranks, both of which use the pool operator summation and the nodal operator, chirp.For layer-1, both of them favors the operator set with index 0:{0,0,0}.
Interestingly, the 3 rd ranked operator set is, 13:{0,1,6} in layer-2, and 16:{1,2,6} in layer-1, for the pool (median=1), activation (tanh=0) and nodal (sin=2), respectively.The pool operator, median, is also used in the 5 th ranked operator set for layer-1 too.For all the problems tackled in this study, although it never got to the top-ranked operator set for any layer, it has obtained 2 nd or 3 rd ranks in some of the problems.Finally, an important observation worth mentioning is the ranking of the native operator set of a CNN with operator index, 0:{0,0,0}, which was evaluated twice during the 2-pass GIS.In both evaluations, among the 10 BP runs performed, the minimum MSE obtained was close to 0.1 which makes it the 17 th and 22 nd best operator set among all the sets evaluated.This means that there are at least 16 operator sets (or equivalently 16 distinct ONN models each with different operator sets but same network configuration) which will yield a better transformation performance than the CNN's.This is, in fact, a "best-case" scenario for CNNs because: 1) GIS cannot evaluate all possible operator set assignments to the two hidden layers (1 and 2).So there are possibly more than 16 operator sets which can yield a better performance than CNN's.
2) If we would not have fixed the operator set of the output layer to 0:{0,0,0}, it is possible to find much more operator assignments to all three layers (2 hidden + 1 output) that may even surpass the performance levels achieved by the topranked operator sets, (0,13,0).

V. CONCLUSIONS
The proposed Operational Neural Networks (ONNs) is inspired from two basic facts: 1) bio-neurological systems including the mammalian visual system are based on heterogeneous, nonlinear neurons with varying synaptic connections, and 2) the corresponding heterogeneous ANN models encapsulating nonlinear neurons (aka GOPs) have recently demonstrated such a superior learning performance that cannot be achieved by their conventional linear counterparts (e.g.MLPs) unless significantly deeper and more complex configurations are used [32]- [35].Empirically speaking, these studies have proven that only heterogeneous networks with the right operator set and a proper training can truly provide the required kernel transformation to discriminate separate classes, or to approximate the underlying complex function.In neuro-biology this fact has been revealed as the "neuro-diversity" or more precisely, "the bio-chemical diversity of the synaptic connections".Accordingly, this study has begun from the point where the GOPs have left over and has extended it to design the ONNs in the same way MLPs are extended to realize conventional CNNs.Having the same two restrictions, i.e., "limited connections" and "weight sharing", heterogeneous ONNs can now perform any (linear or non-linear) operation.Our intention is thus to evaluate convolutional vs. operational layers/neurons; hence we excluded the fullyconnected layers to focus solely into this objective.Moreover, we have selected very challenging problems while keeping the network configurations compact and shallow, and BP training brief.Further restrictions are applied on ONNs such as a limited operator set library with only 7 nodal and 2 pool operators, and the 2-pass GIS is performed to search for the best operators only for the two hidden layers while keeping the output layer as a convolutional layer.As a result, such a restricted and layer-wise homogenous (network-wise heterogeneous) ONN implementation allowed us to evaluate its "baseline" performance against the equivalent and much complex CNNs.
In all problems tackled in this study, ONNs exhibit a superior learning capability against CNNs and the performance gap widens as the severity of the problem increases.For instance, in image denoising, the gap between the average SNR levels in train partition was around 1.5dB (5.59dB vs. 4.1dB).On a harder problem, image synthesis, the gap widens to near 5dB (9.91dB vs. 5.02dB) and on few folds, CNN failed to synthesize the image with a reasonable quality.Finally, on the hardest problem among all, image transformation, the gap exceeded beyond 10dB (10.94dB vs. -0.08dB); in fact, the CNN with the default configuration has failed to transform in all folds.This is also true even though when 4-times more complex CNN model is used and the problem is significantly simplified.This is actually not surprising since a detailed analysis performed during the 2-pass GIS has shown that there are at least 16 other potential ONN models with different operator sets that can perform better than the CNN.So for some, relatively easier, problems "linear convolution" for all layers can indeed be a reasonable or even a sub-optimal choice (e.g.object segmentation or even for image denoising) whereas for harder problems, CNNs may entirely fail (e.g.image synthesis and transformation) unless significantly deeper and more complex configurations are used.The problem therein lies mainly in the "homogeneity" of the network when the same operator set is used for all neurons/layers.This observation has verified in the 1 st fold of the image transformation problem where it sufficed to use a different nonlinear operator set only for a single layer (layer-2, operator set, 13:{0,1,6}) while all other layers are convolutional.This also shows how crucial it is to find the right operator set for each layer.
The aforementioned analysis further revealed the fact that there can be more than one "right" operator set per layer that will yield a similar performance level.For example, on another problem, face segmentation, we observed that the top-ranked operator sets found by GIS for layers 2 and 3 achieved around 29% higher average F1 score than the best CNN on the train set; however, this caused more than 8% lower F1 score on the test partition.This is clearly an "Over-fitting" problem despite the fact that the network configuration is compact.On the other hand, the ONN formed with the 3 rd ranked operator set achieved around 21% and 3% higher average F1 scores on train and test sets, respectively even though GIS was still performed over the train set.Obviously, this operator set should be used rather than the top-ranked one if the generalization performance is the main objective.In order to maximize the generalization performance, for those problems with a validation set this can be changed to search for the optimal operators that will maximize the performance on the validation set.This will, for instance, result in even better generalization performance than the one achieved by the 3 rd ranked operator of the ONN in face segmentation problem.How to maximize the generalization performance for ANNs is a well-investigated topic with many approaches proposed up to date; however, this is beyond of the scope of this study since we focused on the evaluation of the "learning" performance.
Our future studies to improve this "baseline" ONN implementation will focus on:  enriching the operator set library by accommodating other major pool and nodal operators,  forming "layerwise heterogeneous" ONNs for a superior diversity,  and instead of a greedy-search method such as GIS over a limited set of operators, using a global search methodology which can incrementally design the optimal non-linear operator during the BP iterations.

Figure 1 :
Figure 1: A biological neuron (left) with the direction of the signal flow and a synapse (right).

Figure 3 :
Figure 3: Three consecutive convolutional (left) and operational (right) layers with the k th neuron of a CNN (left) and an ONN (right).
i) Low Resolution: We keep the image resolution very low, e.g., thumbnail size (i.e., 60x60 pixels) which makes especially pattern recognition tasks (e.g.face segmentation) even harder.ii) Compact Model: We keep the ONN configuration compact, e.g., only two hidden layers with less than 50 hidden neurons, i.e., Inx16x32xOut.Moreover, we shall keep the output layer as a convolutional layer whilst optimizing only the two hidden layers by GIS.iii) Scarce Train Data: For the two problems (image denoising and segmentation) with train and test datasets, we shall train the network over a limited data (i.e., only 10% of the dataset) while testing over the rest with a 10-fold cross validation.iv) Multiple Regressions: For the two regression problems (image synthesis and transformation), a single network will be trained to regress multiple (e.g., 4-8) images.v) Shallow Training: Maximum number of iterations (iterMax)

Figure 4 :
Figure 4: Best denoising SNR levels for each fold achieved in train (top) and test (bottom) partitions.

Figure 4 showsFigure 5 :
Figure4shows SNR plots of the best CNNs and ONNs at each fold over both partitions.Obviously in both train and test partitions ONNs achieved a significant gap around 1.5dB.It is especially interesting to see that although the ONNs are trained over a minority of the dataset (10%), it can still achieve a similar denoising performance in the test set (between 5 to 5.5 dB SNR) while the SNR level of the majority of the (best) CNNs is below 4dB.The average SNR levels of the ONN vs. CNN denoising for the train and test partitions are 5.59dB vs. 4.1dB, and 5.32dB vs. 3.96dB, respectively.For a visual evaluation, Figure5shows randomly selected original (target) and noisy (input) images and the corresponding outputs of the best CNNs and ONNs from the test partition.The blurring effect of the linear filtering

Figure 6 :
Figure 6: The outputs of the BP-trained ONN with the corresponding input (WGN) and target (original) images from the 2 nd synthesis fold.

Figure 10 :
Figure 10: Image transformation of the 1 st fold including two inverse problems (left) and the outputs of the ONN and CNN with the default configuration, and the two CNNs (CNNx4) with 4 times more parameters.On the bottom, the numbers of input  target images are shown.

Figure 11 :
Figure 11: Image transformations of the 3 rd (top) and 4 th (bottom) folds and the outputs of the ONN, and the two CNNs (CNNx4) with 4 times more parameters.On the bottom, the numbers of input  target images are shown.

Figure 12 :
Figure 12: MSE plot during the 2-pass GIS operation for the 1 st fold.The top-5 ranked operator sets found in three layers, (3 rd , 2 nd , 1 st ) are shown in parantheses.The native operator set of CNNs, (0, 0, 0), with operator set index 0, can get the 17 th and 22 nd ranks among the operator sets searched.