1 Introduction

Convolutional Neural Networks (CNN) are one of the most used tools in artificial intelligence. In recent years, they became a key tool in numerous fields like image classification, face recognition and machine translation (see [13] and references therein). The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the Neural Network (NN), which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration. Independently of the considered approach, it is clear that the task of introducing new activation functions is, for sure, not easy and its benefits are sometimes limited. This limitation comes from the fact that the several proposals presented in the literature reveal to be inconsistent and task-dependent, and therefore the classical activation functions (for example, the ReLU) maintain their predominance in practical applications.

The field of quaternions is the best-known extension of the field of complex numbers to the four-dimensional setting. On one hand, one of the main advantages of this extension is that quaternions form a field where we can consider all the customary operations. On the other hand, when we handle quaternions, we do not have commutativity of the product, which brings several unwanted problems in the process of extension of the theory of holomorphic functions for one complex variable. A possible way to overcome the issue of non-commutativity is to consider the bicomplex numbers, which is a four-dimensional algebra with classical operations, that contains \(\mathbb {C}\) as a subalgebra and preserves the commutativity. Just as in the case of complex numbers, where the individual components of the complex number can be treated independently as two real numbers, the bicomplex space can be treated as two complex or four real numbers.

The bicomplex algebra can be applied in several fields. For example, in [18], the authors point out that the so-called IHS-colour space representation (i.e., Intensity–Hue–Saturation) has broad applications, particularly in human vision, can be mathematically represented by having values in bicomplex numbers. In [8] the authors take vantage of the idempotent representation of bicomplex algebra to prove the possibility of reconstructing a bicomplex sparse signal, with high probability, from a reduced number of bicomplex random samples. Moreover, the possibility of having four separate components allows the consideration of three- and four-dimensional input feature vectors, which are associated with image processing and robot kinematics [6, 21], and also with the field of capsule networks [20]. Thereby, a bicomplex neural network-based models are able to code latent interdependencies between groups of input features during the learning process with fewer parameters than traditional NN, by taking advantage of the component-wise multiplication that can be established between bicomplex numbers. The bicomplex approach gives a better way of combining two-valued complex networks compared to the usual two-valued approach in the literature [3], as we can treat the theory as a single variable one due to its algebra structure. In our approach, the structure of the bicomplex algebra is essential in providing a convergent algorithm. For more details about the theory of bicomplex numbers and its applications we refer the interested reader to [4, 9, 10, 12].

The objective of this paper is to introduce a BCCNN where a generic activation function of Bessel-type is considered. The purpose of our idea is to generalize the results presented in the literature for the quaternionic case and to present an application of the novel activation functions of hypergeometric type introduced in [25]. The structure of the paper reads as follows: Sect. 2 is dedicated to recalling some basic definitions of the bicomplex numbers and special functions, which are necessary for the understanding of this work. In Sect. 3 based on the ideas presented in [25] we introduce a general family of hypergeometric activation functions with several trainable parameters. In the following section, we describe our BCCNN. We perform some numerical experiments in the last section using the Colored MNIST, to validate our approach.

2 Preliminaries

2.1 Bicomplex Algebra

The set \(\mathbb{B}\mathbb{C}\) of bicomplex numbers is defined by

$$\begin{aligned} \mathbb{B}\mathbb{C}:=\left\{ z_1+{\textbf{j}}z_2: \,z_1, z_2 \in \mathbb {C}\right\} , \end{aligned}$$

where \(\mathbb {C}\) is the set os complex numbers with the imaginary unit \({\textbf{i}}\), and where \({\textbf{i}}\) and \({\textbf{j}}\) are commuting imaginary units, i.e., \({\textbf{i}}{\textbf{j}}={\textbf{k}}={\textbf{j}}{\textbf{i}}\), \({\textbf{i}}^2 ={\textbf{j}}^2 =-1\), and \({\textbf{k}}^2 =+1\). Bicomplex numbers can be added and multiplied, in fact, if \(Z=z_1 +{\textbf{j}}z_2\) and \(W =w_1 +{\textbf{j}}w_2\) are two bicomplex numbers, we have that

$$\begin{aligned}&Z +W := \left( z_1 +w_1 \right) +{\textbf{j}}\left( z_2 +w_2 \right) , \\&Z \cdot W :=\left( z_1w_1 -z_2w_2 \right) +{\textbf{j}}\left( z_1w_2 +z_2w_1 \right) , \end{aligned}$$

where the product is not component-wise. From the previous multiplication rules involving \({\textbf{i}}\), \({\textbf{j}}\), and \({\textbf{k}}\), we decompose \(\mathbb{B}\mathbb{C}\) in two ways. Indeed, \(\mathbb{B}\mathbb{C}=\mathbb {C}\left( {\textbf{i}} \right) +{\textbf{j}}\mathbb {C}\left( {\textbf{i}} \right) \), where

$$\begin{aligned} \mathbb {C}\left( {\textbf{i}} \right) =\left\{ Z=z_1+{\textbf{j}}z_2: \,z_2=0\right\} . \end{aligned}$$

Likewise, we have \(\mathbb{B}\mathbb{C}=\mathbb {C}\left( {\textbf{j}} \right) +{\textbf{i}}\mathbb {C}\left( {\textbf{j}} \right) \). Moreover, if we consider \(z_1=x_1+{\textbf{i}}y_1\) and \(z_2 =x_2 +{\textbf{i}}x_2\), with \(x_1, y_1, x_2, y_2 \in \mathbb {R}\), we have the following alternative form of presenting a bicomplex number

$$\begin{aligned} Z \,=z_1 +{\textbf{j}}z_2 \,=x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2. \end{aligned}$$

The structure of \(\mathbb{B}\mathbb{C}\) suggest three possible conjugations on \(\mathbb{B}\mathbb{C}\):

  • the bar-conjugation: \({\overline{Z}} ={\overline{z}}_1 +{\textbf{j}}{\overline{z}}_2\);

  • the \(\dagger \)-conjugation: \(Z^\dagger =z_1-{\textbf{j}}z_2\);

  • the \(*\)-conjugation: \(Z^*=\overline{Z^\dagger } =\left( {\overline{Z}} \right) ^\dagger ={\overline{z}}_1 -{\textbf{j}}{\overline{z}}_2\),

where \({\overline{z}}_1\) and \({\overline{z}}_2\) are usual complex conjugates to \(z_1, z_2 \in \mathbb {C}\left( i \right) \). The euclidian norm of a bicomplex number Z on \(\mathbb{B}\mathbb{C}\), when it is seen as

$$\begin{aligned} \mathbb {C}^2\left( {\textbf{i}} \right) \,:=\mathbb {C}\left( {\textbf{i}} \right) \times \mathbb {C}\left( {\textbf{i}} \right) \,:=\left\{ \left( z_1,z_2 \right) : \,z_1 +{\textbf{j}}z_2 \in \mathbb{B}\mathbb{C}\right\} \end{aligned}$$

or as

$$\begin{aligned} \mathbb {R}^4 =\left\{ \left( x_1, y_1, x_2, y_2 \right) : \, \left( x_1 +{\textbf{i}}y_1 \right) +{\textbf{j}}\left( x_2+{\textbf{i}}y_2 \right) \in \mathbb{B}\mathbb{C}\right\} \end{aligned}$$

is given by

$$\begin{aligned} \left\| Z\right\| =\sqrt{\left| z_1\right| ^2 +\left| z_2\right| ^2} = \sqrt{x_1^2 +y_1^2 +x_2^2 +y_2^2}. \end{aligned}$$
(2.1)

The bicomplex space, \(\mathbb{B}\mathbb{C}\), is not a division algebra, and it has two distinguished zero divisors, namely, \({\textbf{e}}_1\) and \({\textbf{e}}_2\), which are idempotent, linearly independent over reals, and mutually annihilating with respect to the bicomplex multiplications (see [12, Prop. 1.6.1]):

$$\begin{aligned}&{\textbf{e}}_1:=\frac{1+{\textbf{k}}}{2}, \qquad {\textbf{e}}_2:=\frac{1-{\textbf{k}}}{2}, \\&{\textbf{e}}_1\cdot {\textbf{e}}_2=0, \qquad {\textbf{e}}_1^2={\textbf{e}}_1, \qquad {\textbf{e}}_2^2={\textbf{e}}_2, \qquad {\textbf{e}}_1+{\textbf{e}}_2=1, \qquad {\textbf{e}}_1-{\textbf{e}}_2={\textbf{k}}. \end{aligned}$$

We have that \(\left\{ {\textbf{e}}_1, {\textbf{e}}_2\right\} \) form an idempotent basis of the complex algebra \(\mathbb{B}\mathbb{C}\). Considering the complex numbers \(\beta _1 := z_1-{\textbf{i}}z_2\) and \(\beta _2 :=z_1 +{\textbf{i}}z_2\) in \(\mathbb {C}\left( {\textbf{i}} \right) \), we have that the idempotent representation of \(Z=z_1+{\textbf{j}}z_2\) in \(\mathbb{B}\mathbb{C}\left( {\textbf{i}} \right) \) is given by \(Z=\beta _1{\textbf{e}}_1+\beta _2{\textbf{e}}_2\). This idempotent representation is the only representation for which multiplication is component-wise, as it is indicated in the next proposition

Proposition 2.1

(cf. [12, Prop. 1.6.3]) The addition and multiplication of bicomplex numbers can be realized component-wise in the idempotent representation presented previously. Specifically, if \(Z=a_1{\textbf{e}}_1+a_2{\textbf{e}}_2\) and \(W=b_1{\textbf{e}}_1+b_2{\textbf{e}}_2\) are two bicomplex numbers, where \(a_1, a_2, b_1, b_2 \in \mathbb {C}\left( {\textbf{i}} \right) \), then

$$\begin{aligned}&Z+W=\left( a_1+b_1 \right) {\textbf{e}}_1+\left( a_2+b_2 \right) {\textbf{e}}_2, \\&Z \cdot W=\left( a_1b_1 \right) {\textbf{e}}_1+\left( a_2b_2 \right) {\textbf{e}}_2, \\&Z^n =a_1^n{\textbf{e}}_1+a_2^2{\textbf{e}}_2. \end{aligned}$$

The multiplicative inverse of a bicomplex number \(Z=a_1{\textbf{e}}_1+a_2{\textbf{e}}_2\), with \(a_1 \cdot a_2 \ne 0\) is given by \(Z^{-1} =a_1^{-1}{\textbf{e}}_1+a_2^{-1}{\textbf{e}}_2\), where \(a_1^{-1}\) and \(a_2^{-1}\) are the complex multiplicative inverses of \(a_1\) and \(a_2\), respectively (see [12, Thm. 1.6.5]). A bicomplex number may be identified with real \(4 \times 4\) matrices (that turns out to be more suitable for computations):

$$\begin{aligned} \varphi _{\mathbb {R}}: \,Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2 \in \mathbb{B}\mathbb{C}\mapsto \begin{pmatrix} x_1 &{} -y_1 &{} -x_2 &{} y_2 \\ y_1 &{} x_1 &{}-y_2 &{} -x_2 \\ x_2 &{} -y_2 &{} x_1 &{} -y_1 \\ y_2 &{} x_2 &{} y_1 &{} x_1 \end{pmatrix}. \end{aligned}$$
(2.2)

We have that every \(4 \times 4\) matrix determines a linear (more exactly, a real linear) transformation in \(\mathbb {R}^4\), however, not all of them remain \(\mathbb{B}\mathbb{C}\)-linear when \(\mathbb {R}^4\) is seen as \(\mathbb{B}\mathbb{C}\). In the context of bicomplex convolutional neural networks, there are some activation functions proposed in the literature involving bicomplex numbers. For example, in [4] the authors considered the activation function \(P\left( z \right) =\epsilon ^l\) in \(T \subset \mathbb {C}^n\), where \(\epsilon =\exp \left( \frac{2\pi i}{k} \right) \) be the root of the unity of order k, and whenever \(\frac{2\pi il}{k} \le \arg \left( z \right) <\frac{2\pi i\left( l+1 \right) }{k}\).

2.2 Special Functions

In this work, we make use of the generalized hypergeometric function \({}_pF_q\), which is defined by (see [19])

$$\begin{aligned} ^{}_{p}F_{q}\left( {a_1, \,\ldots , \,a_p; \,b_1, \ldots , \,b_q; \,z}\right)= & {} ^{}_{p}F_{q}\left( {\left( a_j \right) _{1:p}; \,\left( b_i \right) _{1:q}; \,z}\right) \nonumber \\= & {} \sum _{l=0}^{+\infty } \frac{\prod _{j=1}^{p}\left( {a_j}\right) _{l}}{\prod _{i=1}^{q}\left( {b_i}\right) _{l}} \,\frac{x^l}{l!}, \end{aligned}$$
(2.3)

where the convergence is guaranteed if one of the following conditions is satisfied:

$$\begin{aligned}{} & {} p \le q, \nonumber \\{} & {} q=p-1 \quad \wedge \quad |z|<1; \nonumber \\{} & {} q=p-1 \quad \wedge \quad \textrm{Re}\left( \sum _{i=1}^{p-1}b_i -\sum _{j=1}^{p}a_j \right) >0 \quad \wedge \quad |z|=1. \end{aligned}$$
(2.4)

Moreover, (2.3) is an analytical function of \(a_1, \ldots , a_p\), \(b_1, \ldots , b_q\) and z which is defined in \(\mathbb {C}^{p+q+1}\). In the cases \(p \le q\) for fixed \(a_1, \ldots , a_p\), \(b_1, \ldots , b_q\), it is an entire function of z. If parameters \(a_k\) include negative integers, the function (2.3) degenerates to a polynomial in z.

Another special function that will play an important role in this work is the Bessel function of the first kin \(J_\nu \), which is defined by the following series (see [1])

$$\begin{aligned} J_{\nu }\left( z \right) =\sum _{k=0}^{+\infty } \frac{\left( -1 \right) ^k}{\Gamma \left( k +\nu +1 \right) \,k!} \,\left( \frac{z}{2} \right) ^{2k+\nu }. \end{aligned}$$
(2.5)

The Bessel function is related to the hypergeometric function \({}_0F_1\) by the following relation (see [1])

$$\begin{aligned} J_{\nu }\left( z \right) =\frac{1}{\Gamma \left( 1+\nu \right) } \left( \frac{z}{2} \right) ^\nu \,^{}_{0}F_{1}\left( {; \,1+\nu ; \,-\frac{x^2}{4}}\right) , \end{aligned}$$
(2.6)

when \(\nu \) is a non-negative integer. For more details about hypergeometric functions and other special functions, we refer, for example, to [1, 2, 14].

3 General Activation Functions

The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the NN, which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration.

In this sense, and following the ideas presented in [25], we consider in our CNN a general multi-parametric activation function in the context of automatic activation function design. The concepts of parametric activation functions presented in [7], and the adaptative activation functions introduced in [26] inspired our work, and were adapted to deal with hypergeometric functions. More precisely, we consider the following activation function:

$$\begin{aligned} {\mathcal {H}}\left( x \right) =c_1 +c_2 \,x^{c_3} +c_4 \,x^{c_5} \,^{}_{p}F_{q}\left( {\left( a_j \right) _{1:p}; \,\left( b_i \right) _{1:q}; \,c_6 \,x^{c_7}}\right) , \end{aligned}$$
(3.1)

where \(c_1, c_2, c_4, c_6 \in \mathbb {R}\), \(c_5 \in \mathbb {N}_0\), \(c_3,c_7 \in \mathbb {N}\), and the parameters in the hypergeometric function satisfy (2.4). Due to a large number of parameters, it is possible to use (3.1) to approximate every continuous function on a compact set. Moreover, in the case where the convergence is guaranteed, it is possible to define sub-ranges of the several parameters that appear in (3.1) in order that the elements of the proposed class have some desirable properties that are useful for the role of the activation function.

The multi-parametric activation function (3.1) groups several of the standard activation functions proposed in the literature for deep NN. In Table 1, we indicate which cases are included in (3.1).

Table 1 General activation function \({\mathcal {H}}\left( x \right) \)

Moreover, in [25] is indicated in a detailed form how we derive the activation functions indicated in Table 1 from the general expression (3.1). For example, if we consider \(c_1=c_4=0\), \(c_2=0\) (for \(x<0\)) or \(c_2=1\) (for \(x \ge 0\)), \(c_3, c_5, c_7 \in \mathbb {N}\), \(c_6 \in \mathbb {R}\), and \(p=q=0\), we obtain

$$\begin{aligned} {\mathcal {H}}\left( x \right) ={\left\{ \begin{array}{ll} 0 &{} \text{ for } x <0 \\ x &{} \text{ for } x\ge 0 \end{array}\right. }, \end{aligned}$$
(3.2)

which corresponds to the classical Rectified linear unit (ReLU).

Let us now pay attention to a particular case of (3.1), that involves the Bessel function of the first kind \(J_{\nu }\left( x \right) \), with \(\nu \) a half-positive integer. In fact, if we consider

$$\begin{aligned}{} & {} c_1=c_2=0, \quad c_4=\sqrt{\frac{\pi }{2}} \,\frac{2^{-\nu }}{\Gamma \left( 1+\nu \right) }, \quad c_5=2\nu , \nonumber \\{} & {} p=0, \quad q=1 \,\left( b_1=1+\nu \right) , \quad c_6=-\frac{1}{4}, \quad c_7=2, \end{aligned}$$
(3.3)

in (3.1), we obtain

$$\begin{aligned} {\mathcal {H}}\left( x \right) \,=\frac{2^{-\nu }}{\Gamma \left( 1+\nu \right) } \,x^{2\nu } \,^{}_{0}F_{1}\left( {-; \,1+\nu ; \,-\frac{x^2}{4}}\right) \,=\sqrt{\frac{\pi }{2}} \,x^\nu \,J_{\nu }\left( x \right) , \end{aligned}$$
(3.4)

which corresponds to a one-parameter activation function. It follows from the properties of the Bessel function of the first kind that for half-integers values \(\nu \) the activation function (3.4) reduces to the combination of polynomials and elementary trigonometric functions such as \(\sin \) and \(\cos \). In fact, for the first four positive half-integers, we have that

$$\begin{aligned}{} & {} \nu =\frac{1}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{1}{2}}J_{\frac{1}{2}}\left( x \right) \,=\sin \left( x \right) , \end{aligned}$$
(3.5)
$$\begin{aligned}{} & {} \nu =\frac{3}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{3}{2}}J_{\frac{3}{2}}\left( x \right) \,=\sin \left( x \right) -x \,\cos \left( x \right) , \end{aligned}$$
(3.6)
$$\begin{aligned}{} & {} \nu =\frac{5}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{5}{2}}J_{\frac{5}{2}}\left( x \right) \,=-\left( x^2-3 \right) \sin \left( x \right) -3x \,\cos \left( x \right) , \end{aligned}$$
(3.7)
$$\begin{aligned}{} & {} \nu =\frac{7}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{7}{2}}J_{\frac{7}{2}}\left( x \right) \,=3\left( 5-2x^2 \right) \sin \left( x \right) +x\left( x^2-15 \right) \cos \left( x \right) . \nonumber \\ \end{aligned}$$
(3.8)

In spite of (3.4) (and also (3.5)–(3.8)) not being monotonic functions in all the positive real line, we can restrict our activation functions to intervals of the form \(I_\nu =[0;M_\nu ]\), where \(M_\nu \) is the first positive zero of the Bessel function \(J_{\nu -1}\left( x \right) \) and corresponds to the first maximum positive point of \(J_{\nu }\left( x \right) \). In order to improve our results (see [25]), we consider from now on the following linear combination of (3.5)–(3.8)

$$\begin{aligned} {\mathcal {B}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \left[ \beta _1 \,x^{\frac{1}{2}} \,J_{\frac{1}{2}}\left( x \right) +\beta _2 \,x^{\frac{3}{2}} \,J_{\frac{3}{2}}\left( x \right) +\beta _3 \,x^{\frac{5}{2}} \,J_{\frac{5}{2}}\left( x \right) +\beta _4 \,x^{\frac{7}{2}} \,J_{\frac{7}{2}}\left( x \right) \right] ,\nonumber \\ \end{aligned}$$
(3.9)

i.e., we combine the Bessel functions (3.5)–(3.8) using trainable parameters to dynamically change how much the contribution of each Bessel function to the final activation function.

4 Bicomplex Convolutional Neural Network

In this section, we define our BCCNN and appropriate parameter initialization. We can understand the BCCNN as a generalization of the quaternionic convolutional neural network (QCNN) (see [5, 15,16,17]) and of the classical real-valued deep CNN (see [11]) to the case we deal with bicomplex numbers. Taking into account [16, 17, 23] about CNN via quaternions and the theory of bicomplex numbers [12] we have that the bicomplex convolution operation is performed via the real-number matrix representation (2.2). Hence, the one-dimensional convolutional layer, with a kernel that contains featured maps, is split into 4 parts: the first part equal to \(x_1\), the second one to \({\textbf{i}}y_1\), the third one to \({\textbf{j}}x_2\), and the last to \({\textbf{k}}y_2\) of a bicomplex number \(Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2\).

For the activation function, we consider a combination of the so-called split activation introduced in [24] for the quaternionic case with the real-valued activation function (3.9) defined in terms of Bessel functions, i.e.,

$$\begin{aligned} {\mathcal {F}}\left( Z \right) ={\mathcal {B}}\left( x_1 \right) +{\textbf{i}}{\mathcal {B}}\left( y_1 \right) +{\textbf{j}}{\mathcal {B}}\left( x_2 \right) +{\textbf{k}}{\mathcal {B}}\left( y_2 \right) . \end{aligned}$$
(4.1)

Taking into account the properties of the Bessel functions and the ideas presented in [4] we can introduce the concept of threshold function associated to our activation function (4.1).

Definition 4.1

Let \(n \ge 1\) and \(T \subset \mathbb {C}^2\left( {\textbf{i}} \right) \). Then, a complex valued function \(f: T \rightarrow \mathbb {C}\) is called a threshold function if there exists a weighting vector \(W =\left( w_0, w_1, w_2 \right) \), where \(w_i \in \mathbb {C}\left( {\textbf{i}} \right) \) such that:

$$\begin{aligned} f\left( z_1, z_2 \right) ={\mathcal {F}}\left( w_0 +w_1z_1 +w_2z_2 \right) , \qquad z_1, z_2 \in T. \end{aligned}$$

Moreover, proceeding similarly as in the proof of Theorem 3.3 of [4], we have the following result:

Theorem 4.2

Let \(T \subset \mathbb {C}^2\left( {\textbf{i}} \right) \) a bounded domain and \(f: T \rightarrow \mathbb {C}\) a threshold function and \(\left( w_0, 0 \right) \) a weighting vector of \(f\left( z_1,z_2 \right) \). Then, there exists \(w_0' \in \mathbb {C}\) and \(\delta >0\) such that \(\left( w_0', w_1, w_2 \right) \) is a weighting vector of f whenever \(\left| w_j\right| <\delta \) with \(j=1, 2\).

Differentiable cost guarantees backward propagation. More precisely, the gradient with respect to a loss function J is expressed for each component of the bicomplex weights \(w^l\) that composes the matrix \(W^l\) at the layer l, being the output layer the quantification of the error with respect to the target vector for each neuron. The convolution of a bicomplex filter matrix with a bicomplex vector is performed taking into account the previous multiplications rules, in fact, let \(W =X_1 +{\textbf{i}}Y_1 +{\textbf{j}}X_2 +{\textbf{k}}Y_2\) be a bicomplex weight filter matrix, and \(Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2\) the bicomplex input vector. The bicomplex convolution \(W \otimes Z\) is defined as follows:

$$\begin{aligned} W \otimes Z= & {} \left( X_1x_1 -Y_1y_1 -X_2x_2 -Y_2y_2 \right) +{\textbf{i}}\left( X_1y_1 +Y_1x_1 +X_2y_2 -Y_2x_2 \right) \nonumber \\{} & {} +{\textbf{j}}\left( X_1x_2 -Y_1y_2 +X_2x_1 +Y_2y_1 \right) +{\textbf{k}}\left( X_1y_2 +Y_1x_2 -X_2y_1 +Y_2x_1 \right) \nonumber \\ \end{aligned}$$
(4.2)

and can thus be expressed in a matrix form following the matrix representation (2.2):

$$\begin{aligned} W \otimes Z =\begin{pmatrix} x_1 &{} -y_1 &{} -x_2 &{} y_2 \\ y_1 &{} x_1 &{}-y_2 &{} -x_2 \\ x_2 &{} -y_2 &{} x_1 &{} -y_1 \\ y_2 &{} x_2 &{} y_1 &{} x_1\end{pmatrix} *\begin{pmatrix} x_1 \\ y_1 \\ x_2 \\ y_2\end{pmatrix} = \begin{pmatrix} x'_1 \\ {\textbf{i}}y'_1 \\ {\textbf{j}}x'_2\\ {\textbf{k}}y'_2 \end{pmatrix} \end{aligned}$$

A suitable initialization scheme improves neural network convergence and reduces the risk of exploding and vanishing gradient. However, bicomplex numbers cannot be initialized component-wise as for the traditional minimization criteria. The reason for this relies on the specific bicomplex algebra and the interaction between the components. Based on the ideas presented in [16, 17], a weight component w of the weight matrix W can be sampled as follows:

$$\begin{aligned}{} & {} w_0 =\lambda \,\cos \left( \theta \right) \nonumber \\{} & {} w_{{\textbf{i}}}=\lambda \,\widetilde{Z}_{{\textbf{i}}} \,\sin \left( \theta \right) , \qquad w_{{\textbf{j}}}=\lambda \,\widetilde{Z}_{{\textbf{j}}} \,\sin \left( \theta \right) , \qquad w_{{\textbf{k}}}=\lambda \,\widetilde{Z}_{{\textbf{k}}} \,\sin \left( \theta \right) . \end{aligned}$$
(4.3)

The angle \(\theta \) is randomly generated in the interval \(\left[ -\pi , \pi \right] \). The bicomplex \(\widetilde{Z}\) is defined as a purely normalized imaginary, and is expressed as \(\widetilde{Z} =0 +{\textbf{i}}\widetilde{Z}_{{\textbf{i}}} +{\textbf{j}}\widetilde{Z}_{{\textbf{j}}} +{\textbf{k}}\widetilde{Z}_{{\textbf{k}}}\). The imaginary components \({\textbf{i}}y_1, {\textbf{j}}x_2\), and \({\textbf{k}}y_2\) are sampled from the uniform distribution in \(\left[ 0,1 \right] \) to obtain Z, which is then normalized via (2.1) to obtain \(\widetilde{Z}\). The parameter \(\lambda \) is sampled from \(\left[ -\sigma , \sigma \right] \), where (see [16, 17])

$$\begin{aligned} \sigma =\frac{1}{\sqrt{2\left( n_{in} +n_{out} \right) }}, \qquad \qquad \text { and } \qquad \qquad \sigma =\frac{1}{\sqrt{2n_{in}}}, \end{aligned}$$

with \(n_{in}\) and \(n_{out}\) the number of neurons on the input and the output layers.

5 Numerical Examples

In this final section, we present a simple numerical implementation where we consider the Bessel-type activation function (4.1) and we compare its behaviour with the classical ReLU activation function in order to perform a comparative analysis of the results and show the effectiveness of our approach.

In our numerical simulation, we consider the Colored MNIST dataset and a BCCNN as a baseline model. The MNIST dataset consists of handwritten digits, which contains a training set of 60 000 examples, and a test set of 10,000 examples. Each sample is a \(28\times 28\) pixels image with the digits 0–9, the values assigned to the pixels elements range from 0 to 255. To obtain the Colored MNIST we display the training images using a color map and its reversed version. We emphasize that the consideration of a colorized version of the MNIST dataset will be more difficult for the network to train (Fig. 1).

Fig. 1
figure 1

First 25 images of the colorised MNIST

The BCCNN model takes into account (4.3) and it is built in a way that we have a convolutional group, which is composed of 2 convolutional layers, the first one has 1 convolutional filter as input and 25 convolutional filters as output, the second layer has 25 filters as input and 50 filters as outputs, each filter has a kernel size of \(3\times 3\). After the convolutional layers, we have a fully connected layer with 28,800 units as input and 100 units outputs followed by the final layer with 100 units inputs and 10 units outputs which gives the final output prediction for the 10 classes expected by the Colorised MNIST dataset. We use ReLU and (4.1) as activation functions with the exception of the last layer where we use a LogSoftmax activation. We employ the negative log-likelihood loss (NLLLoss) and Adam algorithm as optimisers. For the learning rate, we opted to use a dynamic value which is reduced when the loss metric has stopped improving, this is also known as ReduceLROnPlateau. As the initial learning rate value, we follow the guidelines from [22] and choose the value where the gradient towards the minimum loss value is steeper, which in our case, was found to be around \(1.8\times 10^{-3}\).

Fig. 2
figure 2

BCCNN model for the baseline models in the colored MNIST dataset

In Fig. 2, we show the performance of the baseline model for the BCCNN models with ReLU as activation function. In Fig. 2A, the dashed red line highlight shows the layers where the gradient towards the minimum loss value is steeper, in this case, \(2.031\times \text {e}-04\). In Fig. 2B, the continuous (resp. dot-dashed) line shows the loss (resp. accuracy) for the BCCNN model with ReLU activation function. These results will serve as a benchmark to test against the proposed new activation functions. Now we consider (4.1) as an activation function and we see the behaviour of the BCCNN.

Fig. 3
figure 3

Performance of the baseline model for the FC models with Bessel type activation functions

In Fig. 3, the orange (resp. blue) continuous line corresponds to the training (resp. validation) phase for the activation function \({\mathcal {F}}\left( x \right) \) with \(\beta _{i} = 1\), while the dot-dashed green (resp. red) line correspond to the results for the baseline model with ReLU activation function training (resp. validation) phase. From the analysis of Fig. 3, we have that for the case where all \(\beta _i\) in (4.1) are equal to one (see Fig. 3B), the BCCNN gives poor classification accuracy and also shows a constant behaviour. If we let the values of \(\beta _i\) be chosen by the BCCNN as a new parameter during the training phase, we found a better result as displayed in Fig. 3C, although the accuracy on the validation dataset stays around 90%, the maximum accuracy is reached around epoch 20, which shows the advantage of such activation against the traditional ReLU activation.

6 Conclusions

In this paper, we consider bicomplex neural networks with an activation function of Bessel type. The consideration of this new type of activation function leads to better results when compared with the correspondent ones obtained if we consider ReLU. Our numerical experiments reveal that Bessel-type functions combine, in the same activation function, the characteristics of the ReLU and the sinusoid activation functions. In fact, as we indicated in the manuscript, in the case when \(\nu \) is a half-integer positive, the Bessel function reduces to a combination of trigonometric and polynomial functions. Compared with the ReLU activation function, Bessel-type functions reach high levels of accuracy more rapidly. Moreover, due to the influence of the sinusoid component, the Bessel-type activation functions have a lower saturation point when compared with the ReLU activation function.

In future work, it would be interesting to consider bicomplex neural networks in more challenging classification tasks, such as the classification of clinical images. Another possible direction consists in considering this new activation function in the quaternionic case, the hyperbolic case, as well in the case of higher dimension hypercomplex algebras, commutative or not. The consideration of these higher algebras simply reduces errors and their implementation. Hypercomplex valued NN allows the accumulation of several complex variables into a single variable theory that can reduce calculations and improve the accuracy of the algorithms.