“SPOCU”: scaled polynomial constant unit activation function

We address the following problem: given a set of complex images or a large database, the numerical and computational complexity and quality of approximation for neural network may drastically differ from one activation function to another. A general novel methodology, scaled polynomial constant unit activation function “SPOCU,” is introduced and shown to work satisfactorily on a variety of problems. Moreover, we show that SPOCU can overcome already introduced activation functions with good properties, e.g., SELU and ReLU, on generic problems. In order to explain the good properties of SPOCU, we provide several theoretical and practical motivations, including tissue growth model and memristive cellular nonlinear networks. We also provide estimation strategy for SPOCU parameters and its relation to generation of random type of Sierpinski carpet, related to the [pppq] model. One of the attractive properties of SPOCU is its genuine normalization of the output of layers. We illustrate SPOCU methodology on cancer discrimination, including mammary and prostate cancer and data from Wisconsin Diagnostic Breast Cancer dataset. Moreover, we compared SPOCU with SELU and ReLU on large dataset MNIST, which justifies usefulness of SPOCU by its very good performance.


Introduction
Neural computing and activations in neural networks are multidisciplinary topics, which range from neuroscience to theoretical statistical physics. Also, importantly enough, one of the most important theoretical and practical topics in neural networks and neural computing is choice of the appropriate activation function. Activation function and its nonlinear ability represent core of the neural networks, both deep and shallow, with various architectures. In standard problems with a reasonable nonlinearity, the sigmoidal function is well justified; however, one faces the inverse problem of specification of the underlying parameters; otherwise, network may perform slowly. Recently, several papers related to making a more adequate activation function have appeared, mainly comparing activation functions on large datasets. To illustrate some of important contributions, the influence of the activation function in the convolutional neural network (CNN) model is studied in [24], improving a ReLU activation by construction of a novel surrogate. Theoretical analysis about gradient instability as well as the fundamental explanation for the exploding/vanishing gradient and the performances of different activation functions are given in [13].
The main contribution of this paper is the construction and testing of novel scaled polynomial constant unit (SPOCU) activation function. Such a novel activation function relates to complexity patterns through phenomenon of percolation, and thus, it can overcome already introduced activation functions, e.g., SELU and ReLU. In statistical physics and mathematics, percolation theory describes the behavior of a network when nodes or links are removed, or complex patterns are embedded to learning process. Thus, SPOCU well contributes to fill the gap in the theories and it is ''picking up'' the appropriate properties for activation function directly from training of classification on complex patterns, e.g., cancer images. Such efforts can contribute to many applications. One example is the generation of the artificial networks applied in the field of the Touring pattern generators networks, since they require the two-layer coupling (two ''diffusion'' mechanisms having different diffusion coefficients) of resistor couplings. Turing patterns are nonlinear phenomena which appear in the reaction-diffusion systems based on the memristive cell. For Turing patterns in the simplest memristive cellular nonlinear networks (MCNNs), see [2], where it is proposed a new MCNN model consisting in a two-dimensional array of cells made by only two components: a linear passive capacitor and a nonlinear active memristor. In fact, cellular nonlinear networks are perfectly suited to fit the structure of reaction-diffusion systems, since they can be used to map partial differential equations [8]. MCNNs are intrinsically related to percolation; link to this relation is outlined in Sect. 3; for more, see, e.g., Chapter 18 in [6].
The paper is organized as follows. In Sect. 2.1, we introduce random fractal construction useful for our purposes. We give an overview of the problem of generating of random type of Sierpiński carpet, related to the [pppq] model introduced by [9]. We emphasize importance of using Kronecker product, which gives a simple and quick possibility to generate random fractals. We also present several useful results concerning fractal geometry and ''average'' fractal dimension. In Sect. 3, we deal with percolation threshold, which is an important instrument for SPOCU. By using of direct approach and logistic regression, we derive estimations for this critical value in the case of specific random models. In Sect. 4, we use basic generator from random Sierpiński carpet as a new activation function SPOCU for self-normalizing neural network. We study general activation functions, but with main focus on SPOCU. As normalizing the output of layers is known to be a very efficient way to improve the performance of neural networks, we can conclude that SPOCU activation function behaves very well. Further we provide theoretical justification of the fact that SPOCU outperforms both SELU and ReLU in several desirable properties, necessary for the correct classification of complex images. The SPOCU also gets uniformly better performance with respect to generic properties on large MNIST database. This is well illustrated in the last Sect. 6 on image-based discrimination for cancer diagnostics. Therein we apply the developed methodology to cancer discrimination problems; namely, we consider image-based discrimination for mammary cancer versus mastopathy and benign prostatic hyperplasia or normal prostate versus prostate cancer. A comparison of SPOCU, ReLU and SELU on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset justifies SPOCU qualities. Technicalities and proofs are given in ''Appendix A.'' 2 Random fractals, Kronecker product and fractal dimension Many of fractal constructs have random analogues; see Chapter 15 in [6]. Here, we start with the natural motivation of the random Sierpiński carpet (SC). We introduce simple computation of matrix expressing the form of fractals, including random SC.

Random Sierpiń ski carpet
The Sierpiński carpet (SC) is the fractal, which can be constructed by taking square ½0; 1 2 , dividing it into nine equal squares of side length 1/3 and removing the central square. This procedure is then repeated for each of the eight remaining squares and iterated infinitely many times. The carpet is the resulting fractal and has Hausdorff dimension d f ¼ ln 8= ln 3. We can equivalently construct this fractal by using string rewriting beginning with a cell 1 and iterating the rules beginning again with a cell 1, whereas B i;p n ; i ¼ 1; . . .; 8 are mutually independent random variables generated in nth iteration by the uniformly distributed random variable U i;n and prescribed value p n 2 ½0; 1 as & Thus, the random numbers result from a Bernoulli distribution by usage of the inversion method. Notice that [pppq] model, [9], is included to such construction. This notation naturally generalizes it into ½p. . .pq model and also ½p 1 . . .p nÀ1 p n model.

Remark 2.1
Notice also that in [5, Example 1.1] the authors consider also a generalization of Mandelbrot's percolation process, but this approach is different from our (depending on the set of indices generating zeros and ones). With probability p, they generate zero 3 Â 3 matrix or matrix with zero element in the middle with probability 1 À p.

Fractals induced by Kronecker product
The Kronecker product (also direct matrix product) is a special case of the tensor product. It is denoted by and is an operation on two matrices of arbitrary size resulting in a block matrix; see ''Appendix A.'' The multiple Kronecker product is defined in a recursive fashion as Some fractals, see examples in ''Appendix A.2,'' can be represented by multiple Kronecker product, [20, chap. 9]. In [19], two methods for generating images of (approximations to) fractals and fractal-like sets are given: iterated Kronecker products and iterated matrix-valued homomorphisms. For random alternative, we use the notation . From a dynamical point of view, random fractal can be understood as follows Since we assume mutual independence, we have see [7]. We also define matrix M p r :¼ M p;...;p r , i.e., M p r ¼ b r i¼1 X p i ; p i ¼ p for all i 2 f1; . . .; rg. Lemma in ''Appendix A.5'' gives us the mean value of retained elements since we know that in the deterministic case the SC has exactly 2 3n elements in nÀth iteration. Therefore, the joint pdf of elements of Y n is 2 3np n and we have N n ¼ 2 3np n retained elements in average, which reduces to ð8pÞ n for case p j ¼ p. From that, we see the balance threshold p b \ 1 8 for which convergence to zero in average is obtained. (Similar argumentation can be done for modified SC with 9 random elements, and in the results, 2 3n has to be replaced by 3 3n .)

Geometry and dimension
Now we describe the model geometrically, similarly as [3].
. Moreover, we let To define A 2 , we repeat the last construction. More generally, we let B n ij ¼ iÀ1 Clearly fA k g k2N is a decreasing sequence of compact sets so the limit A 1 ¼ T n2N A n . The next theorem follows directly from the average number of retained elements N n and the fact that we obtain a branching process in which each particle has on the average 8p offspring. . This is closely related to the estimation of fractal dimension; see [27]. Here, we deal with ''average'' fractal dimension or fractal dimension in mean. We already have N n the average count of retained elements in A n and denote as s n the scale in nth step, i.e., in our case s n ¼ 3 n . Now using least square fitting on data fln s i ; ln N i g, we obtain the slope of the curve whereas maxf0; sl n g is the nth approximation of fractal dimension. : Naturally for p ¼ 1, this is in coincidence with the standard (deterministic) Sierpiński carpet. See Fig. 1 where the results of ½p 1 p 2 p 3 p 4 model are plotted. Now denote as p sequence (vector) of probabilities fp j g j2N . Notice that it need not to be a probability vector since we do not require ;, i.e., a hyperbola, which determines whether A 1 has a positive measure. This means that if P is small, Q can ''save'' the situation.

Theorem 2.4 For any vector of probabilities
Moreover, for every value from this interval there exist a vector of probabilities p.
Examples in ''Appendix A.3'' show that constant vector and nonconstant vector may lead to a similar fractal dimension. Now we introduce assertion related to expected matrix.
For specific case of B j;p i and under the assumption of independence, we have Binð8; p i Þ: in particular, it is ð8pÞ r for p i ¼ p (which means that for at least small p i the probability of Z value is close to zero). We have estimated this probability P all of event that matrix M p r is zero matrix (empty) and probability P col of event that at least one column of matrix M p r is zero. Probability P all , i.e., that Z is zero value is given by inclusion-exclusion principle. Both are specific percolations, especially the latter, since we can reach the bottom. The results are summarized in Table 6.

Percolation threshold
Percolation itself is a physically well-motivated phenomenon and it is intrinsically related to memristor models (see [22]); thus, it can be well involved in neural modeling of memristive cellular nonlinear networks (MCNNs). The combination of percolation theory and Monte Carlo simulation provides a possible solution to model the ion migration and electron transport in an amorphous system even from the hardware perspective. The most known mathematical percolation model is to take some regular lattice, e.g., a square lattice N Â N, and make it into a random network by randomly ''occupying'' sites (vertices) or bonds (edges) with a statistically independent random variables. For example, each of the lattice sites is either occupied (with probability p) or vacant (with probability 1 À p). This is commonly known as the site percolation problem. There exists also a related bond percolation problem. It can be posed in terms of whether or not the edges between neighboring sites are open or closed. It is (a) (0.9, 0.7, 0.8, 0.7) .6, 0.6, 0.6, 0.6) (b) (0 (c) (0.5, 0.7, 0.8, 0.7) well known that there is a well-defined range of p for which the probability of percolation decreases rapidly from one to zero. It is centered on a critical value p c called percolation threshold. Percolation clusters become self-similar precisely at the threshold density p c for sufficiently large length scales, entailing the following asymptotic power laws: MðLÞ $ L df at p ¼ p c and for large probe sizes, L ! 1, i.e., fractal dimension d f (e.g., Hausdorff dimension dim H ) describes the scaling of the mass M of a critical cluster within a distance L (it characterizes how the mass M(L) changes with the linear size L of the system); see [21]. This follows from the following idea: If we consider a smaller part of a system of linear size bLðb\1Þ, then M(bL) is decreased by a factor of b d , i.e., Depending on the method for obtaining the random network, usually one distinguishes between the site percolation threshold and the bond percolation threshold. More general systems may have several probabilities p 1 , p 2 , etc., and the transition is characterized by a critical surface or a manifold. In the classical systems, it is assumed that the occupation of a site or bond is completely random. This is the so-called Bernoulli percolation. Here, we want to emphasize that random SC does not fall under them. In 1974, Mandelbrot [14] introduced a process in ½0; 1 2 which he called ''canonical curdling.'' It is nothing else than the model from Remark ''Appendix A.1'' with B p ij;n ¼ B p : In paper [3], the authors study the connectivity or ''percolation'' properties of such sets. They showed there is a probability p c 2 ð0; 1Þ so that if p\p c then the set is ''dustlike,'' whereas if p ! p c opposing sides are connected with positive probability. To be precise, we introduce here the exact definition of p c . Let B n ¼ fx 2 A n : xcan be connected to½0; 1 Â f0gand½0; 1 Â f1gby paths inA n g; Notice that when X occurs there is a up-to-down crossing of ½0; 1 2 . Finally, p c ¼ inf p fPðXÞg. For example, from [3] we have that Mandelbrot percolation is p c \0:9999. Here, we have to emphasize that for approximation of A 1 we cannot use this kind of definition. We just set p c as the 1 2 threshold of probability PðB n ¼ ;Þ.

Estimation of a single parameter p
Here, we consider ½p. . .p model. We have used the flow through the generated lattice (represented by the matrix) and use a recursive depth first search (checking whether or not the flow makes it to the bottom of the grid). The graphs of a simulated data possess a sigmoidal shape which signals the presence of a threshold. Moreover, we have categorical dependent variable represented by the outcomes pass/fail. These facts indicate that in order to determine the threshold, we fit a logistic model (log-odds) to simulated data and the threshold value for p (or 1 À p) is estimated such that p ¼ 1  Neural Computing and Applications strategies on the spatial structure of a population considering a spatially explicit birth-death model. A discussion on how different sigmoidal models can be applied to predict the percolation threshold of electrical conductivity for ethylene vinyl acetate (EVA) copolymer and acrylonitrile butadiene (NBR) copolymer conducting composite systems filled with different carbon fillers is given in [17]. On the other hand, an experiment using the phenomenon of percolation has been conducted to demonstrate the implementation of neural functionality in [16], where the curve was found to be almost exactly described by the sigmoid form.
In Table 1, we plot the results for specific r, see also Figs. 3 and 4, where the simulations of percolation probability for given p are shown. The logistic model seems to fit the curve very well. However, for r ¼ 1 (important generator case) we can explicitly find the analytic form of the model function. By going through all the possibilities, we can directly find the polynomial which can be simplified to The threshold value PðpÞ ¼ 1 2 is given approximately as 0.341. See Fig. 3 for excellent fitting and notice that logit model fits very closely with the value 0.3498.

Estimation of two parameters p and q
Here, we assume two parameters p and q, e.g., we consider [pq] or [pppq] model. A binomial logistic model is again fitted to simulated data and the threshold values for p (or 1 À p) and q are estimated such that p ¼ 1 2 implies relationship 0 ¼ b 0 þ b 1 p c þ b 2 q c , and therefore, estimation of p c and q c is given by this (part of) line. This is obviously a different result, since we have infinitely many solutions unless some suitable constraint p ¼ f ðqÞ (such that the intersection of the constraint curve and this line is nonempty) is not given. Notice that even for a given constraint it can happen that more than one solution is achieved. If we, for example, assume that f ¼ id, i.e., p ¼ q, we obtain one-parameter estimation model from the previous section. Notice, however, that if we add a mixed term into regression

Self-normalizing neural network
Self-normalizing neural networks (SNNs) are expected to be robust to perturbations and not to have high variances in their training errors. SNNs push neuron activations to zero mean and unit variance, thereby leading to the same effect as batch normalization, which enables to learn many layers in a robust way. Here, we introduce our SNNs based on percolation function (5). For a neural network with activation function Ac, we consider two consecutive layers connected by a weight matrix W. We assume that all activations x i of the lower layer have the same mean E½x i ¼ l and variance V½x i ¼ r 2 and are mutually independent. A single activation y ¼ f ðzÞ in the higher layer has network input z ¼ w T x, mean E½y ¼l and variance V½y ¼r 2 . From this, we can obtain E½z ¼ P n i¼1 w i E½x i ¼ l P n i¼1 w i :¼ l x and V½z ¼ P n i¼1 w 2 i V½x i ¼ r 2 P n i¼1 w i :¼ r 2 s 2 : Central limit theorem implies under the regularity conditions that z $ N ðl x; r 2 s 2 Þ, i.e., the pdf of z has the form f ðzÞ ¼ 1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2pr 2 s 2 p e À ðzÀl xÞ 2 2r 2 s 2 . Consider now vector mapping g that maps mean and variance of the activations from one layer to mean and variance of the activations in the next layer, i.e., g 1 ðl; r 2 Þ ¼l and g 2 ðl; r 2 Þ ¼r 2 . The following definition recalls a self-normalizing neural network.
Definition 4.1 (Self-normalizing neural net) ( [11]) We say that a neural network is self-normalizing if it possesses a mapping g : X ! X for each activation y that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on x and s 2 in X :¼ ½l min ; l max Â ½r 2 min ; r 2 max . Furthermore, gðXÞ X.
When iteratively applying the mapping g, each point within converges to this fixed point.   For arbitrary activation function Ac(z), the mapping g is given by the relations lðl; x; r 2 ; s 2 ; aÞ ¼ E f ½Acðz; aÞ ð8Þ and r 2 ðl; x; r 2 ; s 2 ; aÞ ¼ E f ½Ac 2 ðz; aÞ Àl : Obviously moments are given by integrals E f ½Acðz; aÞ ¼ R R f ðzÞ Acðz; aÞ dz and E f ½Ac 2 ðz; aÞ ¼ R R f ðzÞ Ac 2 ðz; aÞ dz. [11] proposed x ¼ 0 and s 2 ¼ 1 for all units in the higher layer for the weight initialization. If the Jacobian of g has a norm smaller than 1 at the fixed point, then it is a contraction mapping and the fixed point is stable. The goal is to find parameters a 2 A such that fixed point [0, 1] does exist and that jjJ f ½0; 1jj\1: This problem is formulated as to find a 2 A such that 0 ¼ E f ðz;0;1Þ ½Acðz; aÞ; 1 ¼ E f ðz;0;1Þ ½Ac 2 ðz; aÞ; ð10Þ jjJjj\1, where (here we omit parameters for the sake of short notation) which can be simplified to

Theoretical comparison
Let us first have a look at SNN of SPOCU. We selected c ¼ 1 and c ¼ 1, and we computed numerically a ¼ 2:1959; b ¼ 0:6641 from Eq. (10), whereas Jacobi matrix is J ¼ 0:8603 À 0:0098 À0:0269 0:1001 ! with jjJjj\1: See Fig. 8 for the sigmoidal function s with this choice of parameters. Notice that for these parameters the conditions in Theorem 4.2 are not satisfied, which confirms that theorem does not present a necessary condition. Nevertheless, we have been able to find one triple of parameters which yields SNN derived from SPOCU. Notice that SNNs cannot be derived, e.g., with ReLU, sigmoid units, tanh units and leaky ReLU. This gives an advantage of SPOCU over ReLU, but not necessarily over SELU. The same is true for another desired property. If activation function is nonlinear, then a two-layer neural network can be proved to be a universal function approximator; see [4]. Gradient-based training methods tend to be more stable if activation function has finite range, which is true only for SPOCU. Further properties that give SPOCU an advantage over ReLU and SELU are continuous differentiability, which enables gradient-based optimization methods, and the fact that it approximates identity near the origin, i.e., s 0 ð0Þ ¼ a c r 0 ðbÞ ¼ 1. Then the neural network will learn efficiently when its weights are initialized with small random values; otherwise, a special care must be used when initializing the weight; see [23]. In SPOCU case, this is thanks to additional free parameter. Monotonicity is the only common property for all three activation function. Thus, the error surface associated with a single-layer model is guaranteed to be convex; see [26]. Table 3 summarizes the comparison. As we can see, SPOCU has six out of seven desirable properties. In contrast, ReLU and SELU share only two out of seven and three out of seven good properties, respectively.

Experimental comparison
Here, we show that SPOCU significantly outperforms other two activation functions on simple two-layer DNN model. We have used source code from iris_dnn.R, [28]. For illustration, we use a small dataset, Edgar Anderson's Iris Data (iris), well-known built-in dataset in stock R for machine learning. We have built two-layer DNN model, and subsequently, it was tested. First we transformed the data (into interval ½Àb c; ð1 À bÞ c) by mapping c xÀmin x max xÀmin x À b c with parameters given above, in order to capture polynomial (sigmoidal) influence. Then the dataset is split into two parts for training and testing. Then the training set was used to train the model. See Fig. 9 for the results about the data loss in train set and the accuracy in test compared to SELU and ReLU. For both criteria, SPOCU outperformed both SELU and ReLU.

SPOCU with c = ¥
Here, we illustrate SPOCU with c ¼ 1. Thus we consider activation function (11) for c ¼ 1. For such SPOCU, the range is infinite; thus, the training is generally more efficient because pattern presentations significantly affect most of the weights. The only property we lose here is monotonicity. We computed numerically a ¼ 3:0937; b ¼ 0:6653; c ¼ 4:437 from equations (10); moreover, Jacobi matrix is 0:8331 À 0:1169 0:0874 0:5334 ! : See Fig. 8 for the graph of function S with these parameters. See Fig. 10 for the results, the data loss in train set and the accuracy in test compared to SELU and ReLU. Here, the loss for SPOCU is uniformly better (moreover it falls much faster) with respect to losses of both SELU and ReLU. Moreover, also SPOCU accuracy is better uniformly until 950 steps; then, it may be a bit worse. We also validated SPOCU at the MNIST database (Modified National Institute of Standards and Technology database, [12]) that is a large database of handwritten digits that is commonly used for training various image processing systems. It is also widely used for training and testing in the field of machine learning. Each image in the dataset has dimensions of 28x28 pixels and contains a centered, grayscale digit. The model will take the image as input, and it will output one of the ten possible digits (0 through 9). There are 70000 images in the data. It contains 60000 training images and 10000 testing images. We normalized inputs in order to better facilitate training of the network. We have worked within keras and tensorflow libraries (free and open-source software libraries for dataflow and differentiable programming). A sequential model is used, where each layer has exactly one input tensor and one output tensor. A 2D convolution layer (e.g., spatial convolution over images) was used. This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. A pooling layer MaxPooling2D followed by regularization layer Dropout was also used. Between the dropout and the (a) Activation function s.
(b) Activation function S.  6 Application: cancer tissue discrimination Developing of the algorithmic methods which can assist in cancer risk assessment is an important topic nowadays. Discrimination between mammary cancer and mastopathy tissues plays a crucial role in a clinical practice; see [9]. Noninvasive techniques may produce generally an inverse problems, e.g., estimating a Hausdorff fractal dimension from boundary of examined tissue; see [10]. Main problem here can be formulated as follows: ''How can be cancer tissue discriminated from healthy tissue?'' Here, we study benign prostatic hyperplasia (BPH) and normal prostate vs. prostate cancer (PC). We consider the standard coloring of images, by hematoxylin and eosin, and we consider two magnifications, namely 100Â and 200Â of images; see Figs. 13. Moreover, carcinoma of the breast and mastopathy are given in Fig. 14.
Here, we have used simple estimators, i.e., reduced estimators N ðkÞ ¼ ðk 3 pÞ n , expressing retained elements in average, of k 3np n with k ¼ 2 for standard SC and k ¼ 3 for modified (9 elements instead of 8) from Sect. 2.2. This impliesp ¼ N 1 n =3 k , where N is the number of 1's in measured matrix of data. Since we set the resolution of the figures to 729 Â 729 and 729 ¼ 3 6 , we have n ¼ 6. In our case, modified SC is suitable, since otherwise data implies estimation of p over 1. We had to use only two valuesbinarization, i.e., each dark pixel was converted to 1, and if it is clear, it was converted to 0. If the picture is not in black and white, it was converted according to the formula ðR þ G þ BÞ=3 [ 0:5: We see the results in Table 4. One can see statistically significant difference between cancer and noncancer images, whether we take probability parameter p or fractal dimension dim H . Moreover, we can also see that the package fractaldim cannot catch these differences. This makes this result very valuable.
Notice that we cannot use directly more parameters without obtain more information than resulting matrix. This is quite interesting since it seems to happen that one parameter is not enough, i.e., dimension of the problem is more than one. We have also estimated values for ½p Á q model based on theory of [9] (notice that 0 and 1 has opposite meaning in their work). The obtained results confirmed the same. We illustrate this for both CA and MA; we obtained estimationsp ¼ 0:24447 andq ¼ 0:11515 for CA andp ¼ 0:1934 andq ¼ 0:09675 for MA. An alternative from the perspective of inter-patient variability can be a multifractality (see [15]). Development of such kind of techniques for analysis of several slices from 3D tissue body will be of interest for complicated cases of the National Institutes of Health (NIH) databases, like [18]. This will be a valuable future research direction.    We deliberately focus only on variables of measured data related to fractal properties (the mean and ''worst'' or largest (mean of the three largest values) of fractal dimension we computed for each image-fdm and fdw), since cancer tissue discrimination is our ultimate goal. In Fig. 15, one can see relation between fdm and fdw; this confirms that the clustering is by no means unambiguous. We built DNNs with keras with 3 hidden layers. The number of instances is 569, and we use 80% training samples. In Fig. 16 and Table 5, we can see the results. SPOCU achieved the best results, almost 80% of absolute accuracy which means 96 À 99% performance with respect to benchmark developed in [9].

Conclusion
We introduced novel percolation-based activation function SPOCU which is flexible, and in several important setups, it overcame classical SELU or ReLU approaches. We successfully validated SPOCU on both large and small  which yields the result. h Proof of Thoerem 2.4 Case when at least p j ¼ 0 is trivial. Otherwise, it is sufficient to show that there exist n 2 N such that P n i¼1 ð2i À n À 1ÞPðiÞ 0. Indeed, clearly PðiÞ\0 (for some i) and 2i À n À 1 ! 0 for every i ! nþ1 2 AE Ç . The second part follows directly from the relationship p ! lnð8pÞ ln 3 , which maps bijectively [1/8, 1] to ½0; ln 8= ln 3. h The next lemma is a generalization of the following mixed-product property ðA BÞðC DÞ ¼ ðACÞ ðBDÞ, a proof of which is obvious.
Lemma ''Appendix A.6.'' If A i ; B i ; i ¼ 1. . .r, are matrices of such size that one can form the matrix products A i B i , then holds.
Proof of Theorem 2.5 Frobenius norm of the m Â n matrix can be defined in various ways: where r i are the singular values of A and where A Ã denotes the conjugate transpose of A. Directly from the distributivity over Kronecker product ðA Now from the generalization of the mixed-product property (A. 1) and the fact that the trace of a Kronecker product of square matrices is given by trðA BÞ ¼ trA trB, we have Table 6. Table 6 Estimation of probabilities P all ; P col for given 1 À p and r with 10 5 (10 4 for r [ 5) repetitions