1 Introduction

Neural computing and activations in neural networks are multidisciplinary topics, which range from neuroscience to theoretical statistical physics. Also, importantly enough, one of the most important theoretical and practical topics in neural networks and neural computing is choice of the appropriate activation function. Activation function and its nonlinear ability represent core of the neural networks, both deep and shallow, with various architectures. In standard problems with a reasonable nonlinearity, the sigmoidal function is well justified; however, one faces the inverse problem of specification of the underlying parameters; otherwise, network may perform slowly. Recently, several papers related to making a more adequate activation function have appeared, mainly comparing activation functions on large datasets. To illustrate some of important contributions, the influence of the activation function in the convolutional neural network (CNN) model is studied in [24], improving a ReLU activation by construction of a novel surrogate. Theoretical analysis about gradient instability as well as the fundamental explanation for the exploding/vanishing gradient and the performances of different activation functions are given in [13].

The main contribution of this paper is the construction and testing of novel scaled polynomial constant unit (SPOCU) activation function. Such a novel activation function relates to complexity patterns through phenomenon of percolation, and thus, it can overcome already introduced activation functions, e.g., SELU and ReLU. In statistical physics and mathematics, percolation theory describes the behavior of a network when nodes or links are removed, or complex patterns are embedded to learning process. Thus, SPOCU well contributes to fill the gap in the theories and it is “picking up” the appropriate properties for activation function directly from training of classification on complex patterns, e.g., cancer images. Such efforts can contribute to many applications. One example is the generation of the artificial networks applied in the field of the Touring pattern generators networks, since they require the two-layer coupling (two “diffusion” mechanisms having different diffusion coefficients) of resistor couplings. Turing patterns are nonlinear phenomena which appear in the reaction–diffusion systems based on the memristive cell. For Turing patterns in the simplest memristive cellular nonlinear networks (MCNNs), see [2], where it is proposed a new MCNN model consisting in a two-dimensional array of cells made by only two components: a linear passive capacitor and a nonlinear active memristor. In fact, cellular nonlinear networks are perfectly suited to fit the structure of reaction–diffusion systems, since they can be used to map partial differential equations [8]. MCNNs are intrinsically related to percolation; link to this relation is outlined in Sect. 3; for more, see, e.g., Chapter 18 in [6].

The paper is organized as follows. In Sect. 2.1, we introduce random fractal construction useful for our purposes. We give an overview of the problem of generating of random type of Sierpiński carpet, related to the [pppq] model introduced by [9]. We emphasize importance of using Kronecker product, which gives a simple and quick possibility to generate random fractals. We also present several useful results concerning fractal geometry and “average” fractal dimension. In Sect. 3, we deal with percolation threshold, which is an important instrument for SPOCU. By using of direct approach and logistic regression, we derive estimations for this critical value in the case of specific random models. In Sect. 4, we use basic generator from random Sierpiński carpet as a new activation function SPOCU for self-normalizing neural network. We study general activation functions, but with main focus on SPOCU. As normalizing the output of layers is known to be a very efficient way to improve the performance of neural networks, we can conclude that SPOCU activation function behaves very well. Further we provide theoretical justification of the fact that SPOCU outperforms both SELU and ReLU in several desirable properties, necessary for the correct classification of complex images. The SPOCU also gets uniformly better performance with respect to generic properties on large MNIST database. This is well illustrated in the last Sect. 6 on image-based discrimination for cancer diagnostics. Therein we apply the developed methodology to cancer discrimination problems; namely, we consider image-based discrimination for mammary cancer versus mastopathy and benign prostatic hyperplasia or normal prostate versus prostate cancer. A comparison of SPOCU, ReLU and SELU on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset justifies SPOCU qualities. Technicalities and proofs are given in "Appendix A."

2 Random fractals, Kronecker product and fractal dimension

Many of fractal constructs have random analogues; see Chapter 15 in [6]. Here, we start with the natural motivation of the random Sierpiński carpet (SC). We introduce simple computation of matrix expressing the form of fractals, including random SC.

2.1 Random Sierpiński carpet

The Sierpiński carpet (SC) is the fractal, which can be constructed by taking square \([0,1]^2\), dividing it into nine equal squares of side length 1/3 and removing the central square. This procedure is then repeated for each of the eight remaining squares and iterated infinitely many times. The carpet is the resulting fractal and has Hausdorff dimension \(d_f=\ln 8/\ln 3\). We can equivalently construct this fractal by using string rewriting beginning with a cell 1 and iterating the rules

$$\begin{aligned} \left\{ 0\rightarrow \begin{bmatrix}0&{}0&{}0\\ 0&{}0&{}0\\ 0&{}0&{}0\end{bmatrix}, ~1\rightarrow \begin{bmatrix}1&{}1&{}1\\ 1&{}0&{}1\\ 1&{}1&{}1\end{bmatrix} \right\} . \end{aligned}$$

Here, cell 0 indicates the absence of generated square and cell 1 indicates its presence. Its random variation (random SC) can be defined by using of iterating the rules

$$\begin{aligned} \left\{ 0\rightarrow \begin{bmatrix}0&{}0&{}0\\ 0&{}0&{}0\\ 0&{}0&{}0\end{bmatrix}, ~1\rightarrow \begin{bmatrix}\mathcal {B}_{1,p_n}&{}\mathcal {B}_{2,p_n}&{}\mathcal {B}_{3,p_n}\\ \mathcal {B}_{4,p_n}&{}0&{}\mathcal {B}_{5,p_n}\\ \mathcal {B}_{6,p_n}&{}\mathcal {B}_{7,p_n}&{}\mathcal {B}_{8,p_n}\end{bmatrix} \right\} \end{aligned}$$

beginning again with a cell 1, whereas \(\mathcal {B}_{i,p_n}, i=1,\dots ,8\) are mutually independent random variables generated in nth iteration by the uniformly distributed random variable \(U_{i,n}\) and prescribed value \(p_n\in [0,1]\) as

$$\begin{aligned} \mathcal {B}_{i,p_n}:={\left\{ \begin{array}{ll} 1, &{}\text {if}\,~U_n>p_n,\\ 0, &{}\text {otherwise}. \end{array}\right. } \end{aligned}$$

Thus, the random numbers result from a Bernoulli distribution by usage of the inversion method. Notice that [pppq] model, [9], is included to such construction. This notation naturally generalizes it into \([p\dots pq]\) model and also \([p_1\dots p_{n-1}p_n]\) model.

Remark 2.1

Notice also that in [5, Example 1.1] the authors consider also a generalization of Mandelbrot’s percolation process, but this approach is different from our (depending on the set of indices generating zeros and ones). With probability p, they generate zero \(3\times 3\) matrix or matrix with zero element in the middle with probability \(1-p\).

2.2 Fractals induced by Kronecker product

The Kronecker product (also direct matrix product) is a special case of the tensor product. It is denoted by \(\otimes\) and is an operation on two matrices of arbitrary size resulting in a block matrix; see "Appendix A." The multiple Kronecker product is defined in a recursive fashion as \(\bigotimes _{i=1}^rA_i:=\left( \bigotimes _{i=1}^{r-1}A_i\right) \otimes A_r\), where \(\bigotimes _{i=1}^1A_i=A_1\). Some fractals, see examples in "Appendix A.2," can be represented by multiple Kronecker product, [20, chap. 9]. In [19], two methods for generating images of (approximations to) fractals and fractal-like sets are given: iterated Kronecker products and iterated matrix-valued homomorphisms. For random alternative, we use the notation

$$\begin{aligned} \mathbf{M}_r^{p_1,\dots ,p_r}:=\bigotimes\limits_{i=1}^r \mathbf{X}_{p_i}, \end{aligned}$$

where r is natural number and

$$\begin{aligned} \mathbf{M}_r^{p_1,\dots ,p_\infty }:=\bigotimes\limits_{i=1}^\infty \mathbf{X}_{p_i}, \end{aligned}$$

if it is infinity and \(\mathbf{X}_{p_i}\) means matrix involving values generated by \(\mathcal {B}_{\cdot ,p_i}\). (This can be naturally generalized for suitable random variable.) Obviously \(\mathbf{M}_r^{p_1,\dots ,p_r}\) can be understood as an approximation of \(\mathbf{M}_r^{p_1,\dots ,p_\infty }\). From a dynamical point of view, random fractal can be understood as follows

$$\begin{aligned} \mathbf{Y}_0&:= 1 \end{aligned}$$
$$\begin{aligned} \mathbf{Y}_{n}&:= {} \mathbf{Y}_{n-1}\bigotimes \mathbf{X}_{p_{n}}, ~n\ge 1. \end{aligned}$$

Since we assume mutual independence, we have

$$\begin{aligned} \mathbb {E}\left[ \mathbf{Y}_{n}\right] =\mathbb {E}\left[ \mathbf{Y}_{n-1}\bigotimes \mathbf{X}_{p_{n}}\right] = \mathbb {E}\left[ \mathbf{Y}_{n-1}\right] \bigotimes \mathbb {E}\left[ \mathbf{X}_{p_{n}}\right] =\bigotimes _{j=0}^n\mathbf {E}\left[ \mathbf{X}_{p_{j}}\right] ; \end{aligned}$$

see [7]. We also define matrix \(\mathbf{M}_r^p:=\mathbf{M}_r^{p,\dots ,p}\), i.e., \(\mathbf{M}_r^p=\bigotimes _{i=1}^r \mathbf{X}_{p_i}, ~p_i=p\) for all \(i\in \{1,\dots ,r\}\).

Lemma in "Appendix A.5" gives us the mean value of retained elements since we know that in the deterministic case the SC has exactly \(2^{3n}\) elements in \(n-\)th iteration. Therefore, the joint pdf of elements of \(\mathbf{Y}_{n}\) is \({2^{3n}}\tilde{p}_n\) and we have \(N_n=2^{3n} \tilde{p}_n\) retained elements in average, which reduces to \((8p)^n\) for case \(p_j=p\). From that, we see the balance threshold \(p_b<\frac{1}{8}\) for which convergence to zero in average is obtained. (Similar argumentation can be done for modified SC with 9 random elements, and in the results, \(2^{3n}\) has to be replaced by \(3^{3n}\).)

Fig. 1
figure 1

Random SC for \([p_1p_2p_3p_4]\) model

2.3 Geometry and dimension

Now we describe the model geometrically, similarly as [3]. Let \(A_0=[0,1]^2\), and for \(1\le i,j\le 3\), let \(B_{ij}=\left[ \frac{i-1}{3},\frac{i}{3}\right] \times \left[ \frac{j-1}{3},\frac{j}{3}\right]\). Moreover, we let

$$\begin{aligned} A_1=\bigcup _{\begin{array}{c} i,j\\ Y^{ij}_{1}=1 \end{array}} B_{ij}. \end{aligned}$$

To define \(A_2\), we repeat the last construction. More generally, we let \(B^n_{ij}=\left[ \frac{i-1}{3^n}, \frac{i}{3^n}\right] \times \left[ \frac{j-1}{3^n}, \frac{j}{3^n}\right]\) for \(1\le i,j\le 3^n\) and

$$\begin{aligned} A_n=\bigcup _{\begin{array}{c} i,j\\ Y^{ij}_{n}=1 \end{array}} B^n_{ij}. \end{aligned}$$

Clearly \(\{A_k\}_{k\in \mathbb {N}}\) is a decreasing sequence of compact sets so the limit \(A_\infty =\bigcap _{n\in \mathbb {N}} A_n\). The next theorem follows directly from the average number of retained elements \(N_n\) and the fact that we obtain a branching process in which each particle has on the average 8p offspring.

Theorem 2.2

If \(p_j\) does not change in time (\(p_j=p\)), then \(A_\infty \ne \emptyset\) with positive probability if and only if \(p>\frac{1}{8}\).

This is closely related to the estimation of fractal dimension; see [27]. Here, we deal with “average” fractal dimension or fractal dimension in mean. We already have \(N_n\) the average count of retained elements in \(A_n\) and denote as \(s_n\) the scale in nth step, i.e., in our case \(s_n=3^n\). Now using least square fitting on data \(\{\ln s_i,\ln N_i\}\), we obtain the slope of the curve

$${\text{l}}_{n} = \frac{{n\sum\nolimits_{{i = 1}}^{n} {(\ln s_{i} {\mkern 1mu} \ln N_{i} )} - \sum\nolimits_{{i = 1}}^{n} {\ln } s_{i} \sum\nolimits_{{i = 1}}^{n} {\ln } N_{i} }}{{n\sum\nolimits_{{i = 1}}^{n} {\ln ^{2} } s_{i} - \left( {\sum\nolimits_{{i = 1}}^{n} {\ln } s_{i} } \right)^{2} }},$$

whereas \(\max \{0,\mathrm{sl}_n\}\) is the nth approximation of fractal dimension.

Theorem 2.3

If \(A_\infty \ne \emptyset\), then

$$\begin{aligned} \mathrm {dim}_B(A_\infty )=\mathrm {dim}_H(A_\infty )=\max \left\{ 0,\frac{\ln 8}{\ln 3}+\frac{6}{\ln 3}\lim _{n\rightarrow \infty }\frac{\displaystyle \sum _{i=1}^n \left( (2i-n-1)\sum _{j=1}^i\ln p_j\right) }{n(n-1)(n+1)}\right\} . \end{aligned}$$

For constant case \(p_j=p\), we have \(P(i)=ip\), and thus, \(\mathrm {dim}_B(A_\infty )=\mathrm {dim}_H(A_\infty )=\max \left\{ \frac{\ln (8p)}{\ln 3},0\right\}\); see also Fig. 2. Naturally for \(p=1\), this is in coincidence with the standard (deterministic) Sierpiński carpet. See Fig. 1 where the results of \([p_1p_2p_3p_4]\) model are plotted. Now denote as \(\mathbf{p}\) sequence (vector) of probabilities \(\{p_j\}_{j\in \mathbb {N}}\). Notice that it need not to be a probability vector since we do not require \(\sum _{j=1}^\infty p_j=1\). Consider now the case "Appendix A.4," then

$$\begin{aligned} \mathrm {dim}_H(A_\infty )=\max \left\{ \frac{\ln (8\sqrt{PQ})}{\ln 3},0\right\} . \end{aligned}$$

Fig. 2 plots the condition \(PQ>\frac{1}{64},\), i.e., a hyperbola, which determines whether \(A_\infty\) has a positive measure. This means that if P is small, Q can “save” the situation.

Fig. 2
figure 2

Fractal dimension of \(A_\infty\)

Theorem 2.4

For any vector of probabilities \(\mathbf{p}\)

$$\begin{aligned} 0\le \mathrm {dim}_B(A_\infty )\le d_f=\frac{\ln 8}{\ln 3}. \end{aligned}$$

Moreover, for every value from this interval there exist a vector of probabilities \(\mathbf{p}\).

Examples in "Appendix A.3" show that constant vector and nonconstant vector may lead to a similar fractal dimension. Now we introduce assertion related to expected matrix.

Theorem 2.5

$$\begin{aligned} \left\| \mathbf{M}_r^{p_1,\dots ,p_r}\right\| _F=\sqrt{\prod _{i=1}^r{\text {tr}}\left( \mathbf{X}^*_{p_i}{} \mathbf{X}_{p_i}\right) }. \end{aligned}$$

For specific case of \(\mathcal {B}_{j,p_i}\) and under the assumption of independence, we have

$$\begin{aligned} Z=\left\| \mathbf{M}_r^{p_1,\dots ,p_r}\right\| _F^2= \prod _{i=1}^r\sum _{j=1}^8\mathcal {B}_{j,p_i}^2= \prod _{i=1}^r\sum _{j=1}^8\mathcal {B}_{j,p_i} = \prod _{i=1}^r\mathrm {Bin}(8,p_i). \end{aligned}$$

Therefore, \(\mathbb {E}[Z]=\prod _{i=1}^r8p_i,\); in particular, it is \((8p)^r\) for \(p_i=p\) (which means that for at least small \(p_i\) the probability of Z value is close to zero). We have estimated this probability \(\mathcal {P}_{all}\) of event that matrix \(\mathbf{M}^p_r\) is zero matrix (empty) and probability \(\mathcal {P}_{col}\) of event that at least one column of matrix \(\mathbf{M}^p_r\) is zero. Probability \(\mathcal {P}_{all}\), i.e., that Z is zero value is given by inclusion–exclusion principle. Both are specific percolations, especially the latter, since we can reach the bottom. The results are summarized in Table 6.

3 Percolation threshold

Percolation itself is a physically well-motivated phenomenon and it is intrinsically related to memristor models (see [22]); thus, it can be well involved in neural modeling of memristive cellular nonlinear networks (MCNNs). The combination of percolation theory and Monte Carlo simulation provides a possible solution to model the ion migration and electron transport in an amorphous system even from the hardware perspective. The most known mathematical percolation model is to take some regular lattice, e.g., a square lattice \(N\times N\), and make it into a random network by randomly “occupying” sites (vertices) or bonds (edges) with a statistically independent random variables. For example, each of the lattice sites is either occupied (with probability p) or vacant (with probability \(1-p\)). This is commonly known as the site percolation problem. There exists also a related bond percolation problem. It can be posed in terms of whether or not the edges between neighboring sites are open or closed. It is well known that there is a well-defined range of p for which the probability of percolation decreases rapidly from one to zero. It is centered on a critical value \(p_c\) called percolation threshold. Percolation clusters become self-similar precisely at the threshold density \(p_c\) for sufficiently large length scales, entailing the following asymptotic power laws: \(M(L) \sim L^{d_\text {f}}\) at \(p=p_{c}\) and for large probe sizes, \(L\rightarrow \infty\), i.e., fractal dimension \(d_\text {f}\) (e.g., Hausdorff dimension \(\dim _H\)) describes the scaling of the mass M of a critical cluster within a distance L (it characterizes how the mass M(L) changes with the linear size L of the system); see [21]. This follows from the following idea: If we consider a smaller part of a system of linear size \(bL ~(b < 1)\), then M(bL) is decreased by a factor of \(b^d\), i.e., \(M(bL) = b^d M(L)\). For example, for Sierpinski Gasket we have functional equation \(M(L/2)=M(L)/3\) yielding solution \(M(L)=AL^d_\text {f}\) with \(d_\text {f}=\ln (3)/\ln (2).\)

Depending on the method for obtaining the random network, usually one distinguishes between the site percolation threshold and the bond percolation threshold. More general systems may have several probabilities \(p_1\), \(p_2\), etc., and the transition is characterized by a critical surface or a manifold. In the classical systems, it is assumed that the occupation of a site or bond is completely random. This is the so-called Bernoulli percolation. Here, we want to emphasize that random SC does not fall under them. In 1974, Mandelbrot [14] introduced a process in \([0, 1]^2\) which he called “canonical curdling.” It is nothing else than the model from Remark "Appendix A.1" with \(\mathcal {B}_{p_{ij,n}}=\mathcal {B}_{p}.\) In paper [3], the authors study the connectivity or “percolation” properties of such sets. They showed there is a probability \(p_c\in (0, 1)\) so that if \(p <p_c\) then the set is “dustlike,” whereas if \(p\ge p_c\) opposing sides are connected with positive probability. To be precise, we introduce here the exact definition of \(p_c\). Let

$$\begin{aligned} B_n=\{x\in A_n: x ~\text {can be connected to} ~[0,1]\times \{0\} ~\text {and} ~[0,1]\times \{1\} ~\text {by paths in} ~A_n\}, \end{aligned}$$

\(B_\infty =\bigcap _{n\in \mathbb {N}} B_n\) and let \(\Omega =\{B_\infty \ne \emptyset \}\). Notice that when \(\Omega\) occurs there is a up-to-down crossing of \([0, 1]^2\). Finally, \(p_c=\inf _p\{\mathbb {P}(\Omega )\}\). For example, from [3] we have that Mandelbrot percolation is \(p_c<0.9999\). Here, we have to emphasize that for approximation of \(A_\infty\) we cannot use this kind of definition. We just set \(p_c\) as the \(\frac{1}{2}\) threshold of probability \(\mathbb {P}(B_n=\emptyset )\).

3.1 Estimation of a single parameter p

Here, we consider \([p\dots p]\) model. We have used the flow through the generated lattice (represented by the matrix) and use a recursive depth first search (checking whether or not the flow makes it to the bottom of the grid). The graphs of a simulated data possess a sigmoidal shape which signals the presence of a threshold. Moreover, we have categorical dependent variable represented by the outcomes pass/fail. These facts indicate that in order to determine the threshold, we fit a logistic model (log-odds)

$$\begin{aligned} {\displaystyle \ln \left( {\frac{p}{1-p}}\right) =\beta _{0}+\beta _{1}x,} \end{aligned}$$

to simulated data and the threshold value for p (or \(1-p\)) is estimated such that \(p=\frac{1}{2}\) implies \(0=\beta _{0}+\beta _{1}x\), and therefore, estimation of \(p_c\) equals \(-\frac{\beta _{0}}{\beta _{1}}\), i.e., minus quotient of the intercept and the regression coefficient. Notice that similarly the authors in [1] used logistic function to find threshold in order to examine the effects of mixed dispersal strategies on the spatial structure of a population considering a spatially explicit birth–death model. A discussion on how different sigmoidal models can be applied to predict the percolation threshold of electrical conductivity for ethylene vinyl acetate (EVA) copolymer and acrylonitrile butadiene (NBR) copolymer conducting composite systems filled with different carbon fillers is given in [17]. On the other hand, an experiment using the phenomenon of percolation has been conducted to demonstrate the implementation of neural functionality in [16], where the curve was found to be almost exactly described by the sigmoid form.

In Table 1, we plot the results for specific r, see also Figs. 3 and 4, where the simulations of percolation probability for given p are shown. The logistic model seems to fit the curve very well. However, for \(r=1\) (important generator case) we can explicitly find the analytic form of the model function. By going through all the possibilities, we can directly find the polynomial

$$\begin{aligned} \left( 1-p \right) ^{8}+8\, \left( 1-p \right) ^{7}p+20\, \left( 1-p \right) ^{6}{p}^{2}+20\,{p}^{3} \left( 1-p \right) ^{5}+10\,{p}^{4} \left( 1-p \right) ^{4}+2\,{p}^{5} \left( 1-p \right) ^{3}, \end{aligned}$$

which can be simplified to

$$\begin{aligned} \mathcal {P}(p)=(1-p)^3\left[ (1-p)^5-2(1-p)^4+2\right]. \end{aligned}$$

The threshold value \(\mathcal {P}(p)=\frac{1}{2}\) is given approximately as 0.341. See Fig. 3 for excellent fitting and notice that logit model fits very closely with the value 0.3498.

Table 1 Estimation of percolation threshold \(p_c\) for given r
Fig. 3
figure 3

Percolation probability fitted to simulated data, \(r=1\)

Fig. 4
figure 4

Percolation probability as a function of parameter p

3.2 Estimation of two parameters p and q

Fig. 5
figure 5

Percolation probability for [pq] model

Fig. 6
figure 6

Percolation probability for [ppq] model

Fig. 7
figure 7

Relationship between p and q for logit (6) and (7), respectively

Here, we assume two parameters p and q, e.g., we consider [pq] or [pppq] model. A binomial logistic model

$$\begin{aligned} {\displaystyle \ln \left( {\frac{p}{1-p}}\right) =\beta _{0}+\beta _{1}x+\beta _{2}y} \end{aligned}$$

is again fitted to simulated data and the threshold values for p (or \(1-p\)) and q are estimated such that \(p=\frac{1}{2}\) implies relationship \(0=\beta _{0}+\beta _{1}p_c+\beta _{2}q_c\), and therefore, estimation of \(p_c\) and \(q_c\) is given by this (part of) line. This is obviously a different result, since we have infinitely many solutions unless some suitable constraint \(p=f(q)\) (such that the intersection of the constraint curve and this line is nonempty) is not given. Notice that even for a given constraint it can happen that more than one solution is achieved. If we, for example, assume that \(f=\mathrm {id}\), i.e., \(p=q\), we obtain one-parameter estimation model from the previous section. Table 2 plots the results for constraint \(p=2q\); see also Figs. 5 and 6. For example, for [pq] model we obtained that \(b_0=6.621, ~b_1=-5.725, ~b_2=-5.455\), for [ppq] model we obtained that \(b_0=9.053 , ~b_1=-8.188, ~b_2=-5.492\) and for [pqq] model we obtained \(b_0= 9.159, ~b_1=-5.798, ~b_2= -7.970\). Here, we have to emphasize the difference between [ppq] and [pqq] where for the latter constraint \(p=2q\) is not suitable, since no solution exists.

Notice, however, that if we add a mixed term into regression

$$\begin{aligned} {\displaystyle \ln \left( {\frac{p}{1-p}}\right) =\beta _{0}+\beta _{1}x+\beta _{2}y+\beta _{3}xy,} \end{aligned}$$

then, e.g., for [pqq] model we obtained that \(b_0= 4.827, ~b_1=1.241, ~b_2= -1.734\) and \(b_3=-10.745\) which yields hyperbolic curve. For better graphical illustration of the difference (this difference is not obvious from standard point of view), see Fig. 7. From a qualitative point of view, there exists the difference, e.g., nonexistence of solution for the mixed case even if such a solution exists for the linear case.

Table 2 Estimation of percolation threshold \(p_c\) and \(q_c\) for a given model with constraint \(p=2q\)

4 Self-normalizing neural network

Self-normalizing neural networks (SNNs) are expected to be robust to perturbations and not to have high variances in their training errors. SNNs push neuron activations to zero mean and unit variance, thereby leading to the same effect as batch normalization, which enables to learn many layers in a robust way. Here, we introduce our SNNs based on percolation function (5).

For a neural network with activation function Ac, we consider two consecutive layers connected by a weight matrix \(\mathbf{W}\). We assume that all activations \(x_i\) of the lower layer have the same mean \(\mathbb {E}[x_i]=\mu\) and variance \(\mathbb {V}[x_i]=\sigma ^2\) and are mutually independent. A single activation \(y=f(z)\) in the higher layer has network input \(z=\mathbf{w}^T\mathbf{x}\), mean \(\mathbb {E}[y]=\tilde{\mu }\) and variance \(\mathbb {V}[y]=\tilde{\sigma }^2\). From this, we can obtain \(\mathbb {E}[z]=\sum _{i=1}^n w_i\mathbb {E}[x_i]=\mu \,\sum _{i=1}^n w_i:=\mu \,\omega\) and \(\mathbb {V}[z]=\sum _{i=1}^n w_i^2\mathbb {V}[x_i]=\sigma ^2\,\sum _{i=1}^n w_i:=\sigma ^2\,\tau ^2.\) Central limit theorem implies under the regularity conditions that \(z\sim \mathcal {N}(\mu \,\omega ,\sigma ^2\tau ^2)\), i.e., the pdf of z has the form \({\displaystyle f(z)={\frac{1}{\sqrt{2\pi \sigma ^{2}\tau ^2}}}\mathrm {e}^{-{\frac{(z-\mu \,\omega )^{2}}{2\sigma ^{2}\tau ^2}}}}\). Consider now vector mapping \(\mathbf{g}\) that maps mean and variance of the activations from one layer to mean and variance of the activations in the next layer, i.e., \(g_1(\mu ,\sigma ^2)=\tilde{\mu }\) and \(g_2(\mu ,\sigma ^2)=\tilde{\sigma }^2\). The following definition recalls a self-normalizing neural network.

Definition 4.1

(Self-normalizing neural net) ( [11]) We say that a neural network is self-normalizing if it possesses a mapping \(\mathbf{g} : \Omega \rightarrow \Omega\) for each activation y that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on \(\omega\) and \(\tau ^2\) in \(\Omega :=[\mu _{min},\mu _{max}]\times [\sigma ^2_{min},\sigma ^2_{max}]\). Furthermore, \(\mathbf{g}(\Omega )\subseteq \Omega\). When iteratively applying the mapping \(\mathbf{g}\), each point within converges to this fixed point.

For arbitrary activation function Ac(z), the mapping \(\mathbf{g}\) is given by the relations

$$\begin{aligned} \tilde{\mu }(\mu ,\omega ,\sigma ^2,\tau ^2,\mathbf{a})=\mathbb {E}_f[Ac(z;\mathbf{a})] \end{aligned}$$


$$\begin{aligned} \tilde{\sigma }^2(\mu ,\omega ,\sigma ^2,\tau ^2,\mathbf{a})=\mathbb {E}_f[Ac^2(z;\mathbf{a})]-\tilde{\mu }^. \end{aligned}$$

Obviously moments are given by integrals \(\mathbb {E}_f[Ac(z;\mathbf{a})]=\int _\mathbb {R}f(z)\,Ac(z;\mathbf{a})\,\mathrm {d}z\) and \(\mathbb {E}_f[Ac^2(z;\mathbf{a})]=\int _\mathbb {R}f(z)\,Ac^2(z;\mathbf{a})\,\mathrm {d}z\). [11] proposed \(\omega = 0\) and \(\tau ^2 = 1\) for all units in the higher layer for the weight initialization. If the Jacobian of \(\mathbf{g}\) has a norm smaller than 1 at the fixed point, then it is a contraction mapping and the fixed point is stable. The goal is to find parameters \(\mathbf{a}\in A\) such that fixed point [0, 1] does exist and that \(||\mathbf {J} _{\mathbf {f}}[0,1]||<1.\) This problem is formulated as to find \(\mathbf{a}\in A\) such that

$$\begin{aligned} 0 & = {} \mathbb {E}_{f(z;0,1)}[Ac(z;\mathbf{a})],\nonumber \\ 1 & = {} \mathbb {E}_{f(z;0,1)}[Ac^2(z;\mathbf{a})], \end{aligned}$$

\(||\mathbf {J}||<1\), where (here we omit parameters for the sake of short notation)

$$\begin{aligned} J_{11} & = {} \mathbb {E}_f[z\,Ac(z)], \\ J_{12} & = {} \mathbb {E}_f\left[ \frac{z^2-1}{2}\,Ac(z)\right] , \\ J_{21} & = {} \mathbb {E}_f[z\,Ac^2(z)]-2\mathbb {E}_{f}[Ac(z)]\mathbb {E}_{f}[z\,Ac(z)], \\ J_{22} & = {} \mathbb {E}_f\left[ \frac{z^2-1}{2}\,Ac^2(z)\right] -2\mathbb {E}_{f}[Ac(z)]\mathbb {E}_{f}\left[ \frac{z^2-1}{2}\,Ac(z)\right] , \end{aligned}$$

which can be simplified to

$$\begin{aligned} \mathcal {J} = \begin{bmatrix} \mathbb {E}_f[z\,Ac(z)] &{} \mathbb {E}_f\left[ \frac{z^2-1}{2}\,Ac(z)\right] \\ \mathbb {E}_f[z\,Ac^2(z)]&{} \mathbb {E}_f\left[ \frac{z^2-1}{2}\,Ac^2(z)\right] \\ \end{bmatrix}. \end{aligned}$$

Thus, we can formulate the following problem. For a given activation function Ac, find \(\mathbf{a}\in A\) such that Eq. (10) and \(||\mathcal {J}||<1\) hold. The next theorem follows directly from the Hölder inequality for \(p=1\) and \(q=\infty .\) It gives sufficient condition for a stable and attracting fixed point [0, 1].

Theorem 4.2

If parameters \(\mathbf{a}\) for activation function Ac(x) are such that Eq. (10) are satisfied and

$$\begin{aligned} \sup _{x\in \mathbb {R}}\{|Ac(x)|,Ac^2(x)\}<\sqrt{\frac{\pi \mathrm {e}}{2}}\frac{1}{1+\sqrt{\mathrm {e}}}\approx 0.7801, \end{aligned}$$


$$\begin{aligned} \sup _{x\in \mathbb {R}}\{|Ac(x)|\}+\sup _{x\in \mathbb {R}}\{Ac^2(x)\}<\sqrt{\frac{\pi }{2}}\approx 1.2533, \end{aligned}$$

then the mapping \(\mathbf{g}\) has a stable and attracting fixed point [0, 1].

5 Scaled polynomial constant unit (SPOCU) activation function

Here, we define “scaled polynomial constant unit” (SPOCU) as the novel, well-motivated activation function. We also study its properties. The SPOCU activation function is given by

$$\begin{aligned} s(x)=\alpha \,h\left( \frac{x}{\gamma }+\beta \right) -\alpha \,h(\beta ) \end{aligned}$$

where \(\beta \in (0,1), ~\alpha , ~\gamma >0\) and

$$\begin{aligned} h(x)={\left\{ \begin{array}{ll} r(c),&{} x\ge c,\\ r(x),&{}x\in [0,c),\\ 0,&{}x<0, \end{array}\right. } \end{aligned}$$

with \(r(x)=x^3(x^5-2x^4+2)\) and \(1\le c<\infty\). (We admit c goes to infinity with \(r(c)\rightarrow \infty\).) Clearly s is continuous \(s(0)=0\) and \(s'(x)=\frac{\alpha }{\gamma }\,h'\left( \frac{x}{\gamma }+\beta \right)\). Notice that for \(c=1\) one has \(h'(1^+)=h'(1^-)=0\) and \(h'(0^+)=h'(0^-)=0\) which implies that \(s'\) is continuous too. (This is not true for the second derivative.) For \(c=1\), the range of function s is \(H_s=[s(-\beta \,\gamma ),s((1-\beta )\,\gamma )]=[-\alpha \,r(\beta ),\alpha (1-r(\beta ))], ~r(\beta )\in [0,1]\).

5.1 Theoretical comparison

Let us first have a look at SNN of SPOCU. We selected \(c=1\) and \(\gamma =1\), and we computed numerically \(\alpha =2.1959, \beta =0.6641\) from Eq. (10), whereas Jacobi matrix is

$$\begin{aligned} \mathbf {J}= \left[ \begin{array}{cc} 0.8603&{}-0.0098 \\ - 0.0269 &{} 0.1001 \end{array} \right] \end{aligned}$$

with \(||\mathbf {J}||<1.\) See Fig. 8 for the sigmoidal function s with this choice of parameters. Notice that for these parameters the conditions in Theorem 4.2 are not satisfied, which confirms that theorem does not present a necessary condition. Nevertheless, we have been able to find one triple of parameters which yields SNN derived from SPOCU.

Notice that SNNs cannot be derived, e.g., with ReLU, sigmoid units, tanh units and leaky ReLU. This gives an advantage of SPOCU over ReLU, but not necessarily over SELU. The same is true for another desired property. If activation function is nonlinear, then a two-layer neural network can be proved to be a universal function approximator; see [4]. Gradient-based training methods tend to be more stable if activation function has finite range, which is true only for SPOCU. Further properties that give SPOCU an advantage over ReLU and SELU are continuous differentiability, which enables gradient-based optimization methods, and the fact that it approximates identity near the origin, i.e., \(s'(0)=\frac{\alpha }{\gamma }\,r'(\beta )=1\). Then the neural network will learn efficiently when its weights are initialized with small random values; otherwise, a special care must be used when initializing the weight; see [23]. In SPOCU case, this is thanks to additional free parameter. Monotonicity is the only common property for all three activation function. Thus, the error surface associated with a single-layer model is guaranteed to be convex; see [26].

Table 3 summarizes the comparison. As we can see, SPOCU has six out of seven desirable properties. In contrast, ReLU and SELU share only two out of seven and three out of seven good properties, respectively.

Table 3 Comparison of the properties for three activation functions
Fig. 8
figure 8

SPOCU activation functions

5.2 Experimental comparison

Here, we show that SPOCU significantly outperforms other two activation functions on simple two-layer DNN model. We have used source code from iris_dnn.R, [28]. For illustration, we use a small dataset, Edgar Anderson’s Iris Data (iris), well-known built-in dataset in stock R for machine learning. We have built two-layer DNN model, and subsequently, it was tested. First we transformed the data (into interval \([-\beta \,\gamma ,(1-\beta )\,\gamma ]\)) by mapping \(\gamma \,\frac{\mathbf{x}-\min \mathbf{x}}{\max \mathbf{x}-\min \mathbf{x}}-\beta \,\gamma\) with parameters given above, in order to capture polynomial (sigmoidal) influence. Then the dataset is split into two parts for training and testing. Then the training set was used to train the model. See Fig. 9 for the results about the data loss in train set and the accuracy in test compared to SELU and ReLU. For both criteria, SPOCU outperformed both SELU and ReLU.

Fig. 9
figure 9

Loss and accuracy for transformed dataset iris and activation function s

5.3 SPOCU with \(c=\infty\)

Here, we illustrate SPOCU with \(c=\infty\). Thus we consider activation function (11) for \(c=\infty\). For such SPOCU, the range is infinite; thus, the training is generally more efficient because pattern presentations significantly affect most of the weights. The only property we lose here is monotonicity. We computed numerically \(\alpha = {3.0937}, \beta = 0.{6653}, \gamma = {4.437}\) from equations (10); moreover, Jacobi matrix is

$$\begin{aligned} \left[ \begin{array}{cc} 0.8331&{}- 0.1169 \\ 0.0874&{} 0.5334\end{array} \right] . \end{aligned}$$

See Fig. 8 for the graph of function S with these parameters. See Fig. 10 for the results, the data loss in train set and the accuracy in test compared to SELU and ReLU. Here, the loss for SPOCU is uniformly better (moreover it falls much faster) with respect to losses of both SELU and ReLU. Moreover, also SPOCU accuracy is better uniformly until 950 steps; then, it may be a bit worse.

Fig. 10
figure 10

Loss and accuracy for dataset iris and activation function S

We also validated SPOCU at the MNIST database (Modified National Institute of Standards and Technology database, [12]) that is a large database of handwritten digits that is commonly used for training various image processing systems. It is also widely used for training and testing in the field of machine learning. Each image in the dataset has dimensions of 28x28 pixels and contains a centered, grayscale digit. The model will take the image as input, and it will output one of the ten possible digits (0 through 9). There are 70000 images in the data. It contains 60000 training images and 10000 testing images. We normalized inputs in order to better facilitate training of the network. We have worked within keras and tensorflow libraries (free and open-source software libraries for dataflow and differentiable programming). A sequential model is used, where each layer has exactly one input tensor and one output tensor. A 2D convolution layer (e.g., spatial convolution over images) was used. This layer creates a convolution kernel that is convolved with the layer input to produce a tensor of outputs. A pooling layer MaxPooling2D followed by regularization layer Dropout was also used. Between the dropout and the dense layers, there is the Flatten layer, which converts the 2D matrix data to a vector. For results, see Figs. 11 and 12. Clearly SPOCU overcomes SELU in loss and ReLU in both criteria. Moreover, SPOCU and SELU reached comparable accuracy.

Fig. 11
figure 11


Fig. 12
figure 12


6 Application: cancer tissue discrimination

Developing of the algorithmic methods which can assist in cancer risk assessment is an important topic nowadays. Discrimination between mammary cancer and mastopathy tissues plays a crucial role in a clinical practice; see [9]. Noninvasive techniques may produce generally an inverse problems, e.g., estimating a Hausdorff fractal dimension from boundary of examined tissue; see [10]. Main problem here can be formulated as follows: “How can be cancer tissue discriminated from healthy tissue?”

Here, we study benign prostatic hyperplasia (BPH) and normal prostate vs. prostate cancer (PC). We consider the standard coloring of images, by hematoxylin and eosin, and we consider two magnifications, namely \(100\times\) and \(200\times\) of images; see Figs. 13. Moreover, carcinoma of the breast and mastopathy are given in Fig. 14.

Table 4 Estimation of p and \(\dim _H\)

Here, we have used simple estimators, i.e., reduced estimators \(\mathcal {N}(k)=(k^3p)^n\), expressing retained elements in average, of \(k^{3n}\tilde{p}_n\) with \(k=2\) for standard SC and \(k=3\) for modified (9 elements instead of 8) from Sect. 2.2. This implies \(\hat{p}=N^\frac{1}{n}/3^k\), where N is the number of 1’s in measured matrix of data. Since we set the resolution of the figures to \(729\times 729\) and \(729=3^6\), we have \(n=6\). In our case, modified SC is suitable, since otherwise data implies estimation of p over 1. We had to use only two values—binarization, i.e., each dark pixel was converted to 1, and if it is clear, it was converted to 0. If the picture is not in black and white, it was converted according to the formula \((R+G+B)/3>0.5.\) We see the results in Table 4. One can see statistically significant difference between cancer and noncancer images, whether we take probability parameter p or fractal dimension \(\dim _H\). Moreover, we can also see that the package fractaldim cannot catch these differences. This makes this result very valuable.

Fig. 13
figure 13

PC versus BPH

Fig. 14
figure 14

CA versus MA

Notice that we cannot use directly more parameters without obtain more information than resulting matrix. This is quite interesting since it seems to happen that one parameter is not enough, i.e., dimension of the problem is more than one. We have also estimated values for \([p\cdot q]\) model based on theory of [9] (notice that 0 and 1 has opposite meaning in their work). The obtained results confirmed the same. We illustrate this for both CA and MA; we obtained estimations \(\hat{p}=0.24447\) and \(\hat{q}=0.11515\) for CA and \(\hat{p}=0.1934\) and \(\hat{q}=0.09675\) for MA. An alternative from the perspective of inter-patient variability can be a multifractality (see [15]). Development of such kind of techniques for analysis of several slices from 3D tissue body will be of interest for complicated cases of the National Institutes of Health (NIH) databases, like [18]. This will be a valuable future research direction.

6.1 The diagnosis of breast tissues (M = malignant, B = benign).

Here, we compare modified SPOCU, ReLU and SELU on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset [25]. We measured their classification test accuracy and loss. The data, obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg, contain measurements on cells in suspicious lumps in a women’s breast. In total, 357 observations (62.7%) indicate the absence of cancer cells, and 212 (37.3%) show the presence of cancerous cell. The percent is unusually large; it does not represents a typical medical analysis distribution. Typically, we will have a considerable large number of cases that represents negative vs. a small number of cases that represents positives (malignant tumors). Data include ten real-valued features computed for each cell nucleus. We deliberately focus only on variables of measured data related to fractal properties (the mean and “worst” or largest (mean of the three largest values) of fractal dimension we computed for each image—fdm and fdw), since cancer tissue discrimination is our ultimate goal. In Fig. 15, one can see relation between fdm and fdw; this confirms that the clustering is by no means unambiguous. We built DNNs with keras with 3 hidden layers. The number of instances is 569, and we use 80% training samples. In Fig. 16 and Table 5, we can see the results. SPOCU achieved the best results, almost 80% of absolute accuracy which means \(96-99\%\) performance with respect to benchmark developed in [9].

Table 5 Comparison of three activation function at the final epoch
Fig. 15
figure 15

fdm versus fdw for

Fig. 16
figure 16

Loss and accuracy

7 Conclusion

We introduced novel percolation-based activation function SPOCU which is flexible, and in several important setups, it overcame classical SELU or ReLU approaches. We successfully validated SPOCU on both large and small datasets, including Wisconsin Diagnostic Breast Cancer (WDBC) dataset and large dataset MNIST. We also provided careful theoretical comparisons of SPOCU to SELU or ReLU competitors.