Bicomplex Neural Networks with Hypergeometric Activation Functions

Vieira, Nelson

doi:10.1007/s00006-023-01268-w

Bicomplex Neural Networks with Hypergeometric Activation Functions

Open access
Published: 13 March 2023

Volume 33, article number 20, (2023)
Cite this article

Download PDF

You have full access to this open access article

Advances in Applied Clifford Algebras Aims and scope Submit manuscript

Bicomplex Neural Networks with Hypergeometric Activation Functions

Download PDF

Nelson Vieira ORCID: orcid.org/0000-0001-8756-4893¹

1661 Accesses
2 Citations
Explore all metrics

Abstract

Bicomplex convolutional neural networks (BCCNN) are a natural extension of the quaternion convolutional neural networks for the bicomplex case. As it happens with the quaternionic case, BCCNN has the capability of learning and modelling external dependencies that exist between neighbour features of an input vector and internal latent dependencies within the feature. This property arises from the fact that, under certain circumstances, it is possible to deal with the bicomplex number in a component-wise way. In this paper, we present a BCCNN, and we apply it to a classification task involving the colourized version of the well-known dataset MNIST. Besides the novelty of considering bicomplex numbers, our CNN considers an activation function a Bessel-type function. As we see, our results present better results compared with the one where the classical ReLU activation function is considered.

Complex-Valued Densely Connected Convolutional Networks

A new approach to neural networks using pseudo-differential operators

Article 21 January 2024

On Bicomplex Fibonacci Numbers and Their Generalization

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Convolutional Neural Networks (CNN) are one of the most used tools in artificial intelligence. In recent years, they became a key tool in numerous fields like image classification, face recognition and machine translation (see [13] and references therein). The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the Neural Network (NN), which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration. Independently of the considered approach, it is clear that the task of introducing new activation functions is, for sure, not easy and its benefits are sometimes limited. This limitation comes from the fact that the several proposals presented in the literature reveal to be inconsistent and task-dependent, and therefore the classical activation functions (for example, the ReLU) maintain their predominance in practical applications.

The field of quaternions is the best-known extension of the field of complex numbers to the four-dimensional setting. On one hand, one of the main advantages of this extension is that quaternions form a field where we can consider all the customary operations. On the other hand, when we handle quaternions, we do not have commutativity of the product, which brings several unwanted problems in the process of extension of the theory of holomorphic functions for one complex variable. A possible way to overcome the issue of non-commutativity is to consider the bicomplex numbers, which is a four-dimensional algebra with classical operations, that contains $\mathbb {C}$ as a subalgebra and preserves the commutativity. Just as in the case of complex numbers, where the individual components of the complex number can be treated independently as two real numbers, the bicomplex space can be treated as two complex or four real numbers.

The bicomplex algebra can be applied in several fields. For example, in [18], the authors point out that the so-called IHS-colour space representation (i.e., Intensity–Hue–Saturation) has broad applications, particularly in human vision, can be mathematically represented by having values in bicomplex numbers. In [8] the authors take vantage of the idempotent representation of bicomplex algebra to prove the possibility of reconstructing a bicomplex sparse signal, with high probability, from a reduced number of bicomplex random samples. Moreover, the possibility of having four separate components allows the consideration of three- and four-dimensional input feature vectors, which are associated with image processing and robot kinematics [6, 21], and also with the field of capsule networks [20]. Thereby, a bicomplex neural network-based models are able to code latent interdependencies between groups of input features during the learning process with fewer parameters than traditional NN, by taking advantage of the component-wise multiplication that can be established between bicomplex numbers. The bicomplex approach gives a better way of combining two-valued complex networks compared to the usual two-valued approach in the literature [3], as we can treat the theory as a single variable one due to its algebra structure. In our approach, the structure of the bicomplex algebra is essential in providing a convergent algorithm. For more details about the theory of bicomplex numbers and its applications we refer the interested reader to [4, 9, 10, 12].

The objective of this paper is to introduce a BCCNN where a generic activation function of Bessel-type is considered. The purpose of our idea is to generalize the results presented in the literature for the quaternionic case and to present an application of the novel activation functions of hypergeometric type introduced in [25]. The structure of the paper reads as follows: Sect. 2 is dedicated to recalling some basic definitions of the bicomplex numbers and special functions, which are necessary for the understanding of this work. In Sect. 3 based on the ideas presented in [25] we introduce a general family of hypergeometric activation functions with several trainable parameters. In the following section, we describe our BCCNN. We perform some numerical experiments in the last section using the Colored MNIST, to validate our approach.

2 Preliminaries

2.1 Bicomplex Algebra

The set $\mathbb{B}\mathbb{C}$ of bicomplex numbers is defined by

$$\begin{aligned} \mathbb{B}\mathbb{C}:=\left\{ z_1+{\textbf{j}}z_2: \,z_1, z_2 \in \mathbb {C}\right\} , \end{aligned}$$

where $\mathbb {C}$ is the set os complex numbers with the imaginary unit ${\textbf{i}}$, and where ${\textbf{i}}$ and ${\textbf{j}}$ are commuting imaginary units, i.e., ${\textbf{i}}{\textbf{j}}={\textbf{k}}={\textbf{j}}{\textbf{i}}$, ${\textbf{i}}^2 ={\textbf{j}}^2 =-1$, and ${\textbf{k}}^2 =+1$. Bicomplex numbers can be added and multiplied, in fact, if $Z=z_1 +{\textbf{j}}z_2$ and $W =w_1 +{\textbf{j}}w_2$ are two bicomplex numbers, we have that

$$\begin{aligned}&Z +W := \left( z_1 +w_1 \right) +{\textbf{j}}\left( z_2 +w_2 \right) , \\&Z \cdot W :=\left( z_1w_1 -z_2w_2 \right) +{\textbf{j}}\left( z_1w_2 +z_2w_1 \right) , \end{aligned}$$

where the product is not component-wise. From the previous multiplication rules involving ${\textbf{i}}$, ${\textbf{j}}$, and ${\textbf{k}}$, we decompose $\mathbb{B}\mathbb{C}$ in two ways. Indeed, $\mathbb{B}\mathbb{C}=\mathbb {C}\left( {\textbf{i}} \right) +{\textbf{j}}\mathbb {C}\left( {\textbf{i}} \right) $, where

$$\begin{aligned} \mathbb {C}\left( {\textbf{i}} \right) =\left\{ Z=z_1+{\textbf{j}}z_2: \,z_2=0\right\} . \end{aligned}$$

Likewise, we have $\mathbb{B}\mathbb{C}=\mathbb {C}\left( {\textbf{j}} \right) +{\textbf{i}}\mathbb {C}\left( {\textbf{j}} \right) $. Moreover, if we consider $z_1=x_1+{\textbf{i}}y_1$ and $z_2 =x_2 +{\textbf{i}}x_2$, with $x_1, y_1, x_2, y_2 \in \mathbb {R}$, we have the following alternative form of presenting a bicomplex number

$$\begin{aligned} Z \,=z_1 +{\textbf{j}}z_2 \,=x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2. \end{aligned}$$

The structure of $\mathbb{B}\mathbb{C}$ suggest three possible conjugations on $\mathbb{B}\mathbb{C}$:

the bar-conjugation: ${\overline{Z}} ={\overline{z}}_1 +{\textbf{j}}{\overline{z}}_2$;
the $\dagger $-conjugation: $Z^\dagger =z_1-{\textbf{j}}z_2$;
the $*$-conjugation: $Z^*=\overline{Z^\dagger } =\left( {\overline{Z}} \right) ^\dagger ={\overline{z}}_1 -{\textbf{j}}{\overline{z}}_2$,

where ${\overline{z}}_1$ and ${\overline{z}}_2$ are usual complex conjugates to $z_1, z_2 \in \mathbb {C}\left( i \right) $. The euclidian norm of a bicomplex number Z on $\mathbb{B}\mathbb{C}$, when it is seen as

$$\begin{aligned} \mathbb {C}^2\left( {\textbf{i}} \right) \,:=\mathbb {C}\left( {\textbf{i}} \right) \times \mathbb {C}\left( {\textbf{i}} \right) \,:=\left\{ \left( z_1,z_2 \right) : \,z_1 +{\textbf{j}}z_2 \in \mathbb{B}\mathbb{C}\right\} \end{aligned}$$

or as

$$\begin{aligned} \mathbb {R}^4 =\left\{ \left( x_1, y_1, x_2, y_2 \right) : \, \left( x_1 +{\textbf{i}}y_1 \right) +{\textbf{j}}\left( x_2+{\textbf{i}}y_2 \right) \in \mathbb{B}\mathbb{C}\right\} \end{aligned}$$

is given by

$$\begin{aligned} \left\| Z\right\| =\sqrt{\left| z_1\right| ^2 +\left| z_2\right| ^2} = \sqrt{x_1^2 +y_1^2 +x_2^2 +y_2^2}. \end{aligned}$$

(2.1)

The bicomplex space, $\mathbb{B}\mathbb{C}$, is not a division algebra, and it has two distinguished zero divisors, namely, ${\textbf{e}}_1$ and ${\textbf{e}}_2$, which are idempotent, linearly independent over reals, and mutually annihilating with respect to the bicomplex multiplications (see [12, Prop. 1.6.1]):

$$\begin{aligned}&{\textbf{e}}_1:=\frac{1+{\textbf{k}}}{2}, \qquad {\textbf{e}}_2:=\frac{1-{\textbf{k}}}{2}, \\&{\textbf{e}}_1\cdot {\textbf{e}}_2=0, \qquad {\textbf{e}}_1^2={\textbf{e}}_1, \qquad {\textbf{e}}_2^2={\textbf{e}}_2, \qquad {\textbf{e}}_1+{\textbf{e}}_2=1, \qquad {\textbf{e}}_1-{\textbf{e}}_2={\textbf{k}}. \end{aligned}$$

We have that $\left\{ {\textbf{e}}_1, {\textbf{e}}_2\right\} $ form an idempotent basis of the complex algebra $\mathbb{B}\mathbb{C}$. Considering the complex numbers $\beta _1 := z_1-{\textbf{i}}z_2$ and $\beta _2 :=z_1 +{\textbf{i}}z_2$ in $\mathbb {C}\left( {\textbf{i}} \right) $, we have that the idempotent representation of $Z=z_1+{\textbf{j}}z_2$ in $\mathbb{B}\mathbb{C}\left( {\textbf{i}} \right) $ is given by $Z=\beta _1{\textbf{e}}_1+\beta _2{\textbf{e}}_2$. This idempotent representation is the only representation for which multiplication is component-wise, as it is indicated in the next proposition

Proposition 2.1

(cf. [12, Prop. 1.6.3]) The addition and multiplication of bicomplex numbers can be realized component-wise in the idempotent representation presented previously. Specifically, if $Z=a_1{\textbf{e}}_1+a_2{\textbf{e}}_2$ and $W=b_1{\textbf{e}}_1+b_2{\textbf{e}}_2$ are two bicomplex numbers, where $a_1, a_2, b_1, b_2 \in \mathbb {C}\left( {\textbf{i}} \right) $, then

$$\begin{aligned}&Z+W=\left( a_1+b_1 \right) {\textbf{e}}_1+\left( a_2+b_2 \right) {\textbf{e}}_2, \\&Z \cdot W=\left( a_1b_1 \right) {\textbf{e}}_1+\left( a_2b_2 \right) {\textbf{e}}_2, \\&Z^n =a_1^n{\textbf{e}}_1+a_2^2{\textbf{e}}_2. \end{aligned}$$

The multiplicative inverse of a bicomplex number $Z=a_1{\textbf{e}}_1+a_2{\textbf{e}}_2$, with $a_1 \cdot a_2 \ne 0$ is given by $Z^{-1} =a_1^{-1}{\textbf{e}}_1+a_2^{-1}{\textbf{e}}_2$, where $a_1^{-1}$ and $a_2^{-1}$ are the complex multiplicative inverses of $a_1$ and $a_2$, respectively (see [12, Thm. 1.6.5]). A bicomplex number may be identified with real $4 \times 4$ matrices (that turns out to be more suitable for computations):

$$\begin{aligned} \varphi _{\mathbb {R}}: \,Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2 \in \mathbb{B}\mathbb{C}\mapsto \begin{pmatrix} x_1 &{} -y_1 &{} -x_2 &{} y_2 \\ y_1 &{} x_1 &{}-y_2 &{} -x_2 \\ x_2 &{} -y_2 &{} x_1 &{} -y_1 \\ y_2 &{} x_2 &{} y_1 &{} x_1 \end{pmatrix}. \end{aligned}$$

(2.2)

We have that every $4 \times 4$ matrix determines a linear (more exactly, a real linear) transformation in $\mathbb {R}^4$, however, not all of them remain $\mathbb{B}\mathbb{C}$-linear when $\mathbb {R}^4$ is seen as $\mathbb{B}\mathbb{C}$. In the context of bicomplex convolutional neural networks, there are some activation functions proposed in the literature involving bicomplex numbers. For example, in [4] the authors considered the activation function $P\left( z \right) =\epsilon ^l$ in $T \subset \mathbb {C}^n$, where $\epsilon =\exp \left( \frac{2\pi i}{k} \right) $ be the root of the unity of order k, and whenever $\frac{2\pi il}{k} \le \arg \left( z \right) <\frac{2\pi i\left( l+1 \right) }{k}$.

2.2 Special Functions

In this work, we make use of the generalized hypergeometric function ${}_pF_q$, which is defined by (see [19])

$$\begin{aligned} ^{}_{p}F_{q}\left( {a_1, \,\ldots , \,a_p; \,b_1, \ldots , \,b_q; \,z}\right)= & {} ^{}_{p}F_{q}\left( {\left( a_j \right) _{1:p}; \,\left( b_i \right) _{1:q}; \,z}\right) \nonumber \\= & {} \sum _{l=0}^{+\infty } \frac{\prod _{j=1}^{p}\left( {a_j}\right) _{l}}{\prod _{i=1}^{q}\left( {b_i}\right) _{l}} \,\frac{x^l}{l!}, \end{aligned}$$

(2.3)

where the convergence is guaranteed if one of the following conditions is satisfied:

$$\begin{aligned}{} & {} p \le q, \nonumber \\{} & {} q=p-1 \quad \wedge \quad |z|<1; \nonumber \\{} & {} q=p-1 \quad \wedge \quad \textrm{Re}\left( \sum _{i=1}^{p-1}b_i -\sum _{j=1}^{p}a_j \right) >0 \quad \wedge \quad |z|=1. \end{aligned}$$

(2.4)

Moreover, (2.3) is an analytical function of $a_1, \ldots , a_p$, $b_1, \ldots , b_q$ and z which is defined in $\mathbb {C}^{p+q+1}$. In the cases $p \le q$ for fixed $a_1, \ldots , a_p$, $b_1, \ldots , b_q$, it is an entire function of z. If parameters $a_k$ include negative integers, the function (2.3) degenerates to a polynomial in z.

Another special function that will play an important role in this work is the Bessel function of the first kin $J_\nu $, which is defined by the following series (see [1])

$$\begin{aligned} J_{\nu }\left( z \right) =\sum _{k=0}^{+\infty } \frac{\left( -1 \right) ^k}{\Gamma \left( k +\nu +1 \right) \,k!} \,\left( \frac{z}{2} \right) ^{2k+\nu }. \end{aligned}$$

(2.5)

The Bessel function is related to the hypergeometric function ${}_0F_1$ by the following relation (see [1])

$$\begin{aligned} J_{\nu }\left( z \right) =\frac{1}{\Gamma \left( 1+\nu \right) } \left( \frac{z}{2} \right) ^\nu \,^{}_{0}F_{1}\left( {; \,1+\nu ; \,-\frac{x^2}{4}}\right) , \end{aligned}$$

(2.6)

when $\nu $ is a non-negative integer. For more details about hypergeometric functions and other special functions, we refer, for example, to [1, 2, 14].

3 General Activation Functions

The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the NN, which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration.

In this sense, and following the ideas presented in [25], we consider in our CNN a general multi-parametric activation function in the context of automatic activation function design. The concepts of parametric activation functions presented in [7], and the adaptative activation functions introduced in [26] inspired our work, and were adapted to deal with hypergeometric functions. More precisely, we consider the following activation function:

$$\begin{aligned} {\mathcal {H}}\left( x \right) =c_1 +c_2 \,x^{c_3} +c_4 \,x^{c_5} \,^{}_{p}F_{q}\left( {\left( a_j \right) _{1:p}; \,\left( b_i \right) _{1:q}; \,c_6 \,x^{c_7}}\right) , \end{aligned}$$

(3.1)

where $c_1, c_2, c_4, c_6 \in \mathbb {R}$, $c_5 \in \mathbb {N}_0$, $c_3,c_7 \in \mathbb {N}$, and the parameters in the hypergeometric function satisfy (2.4). Due to a large number of parameters, it is possible to use (3.1) to approximate every continuous function on a compact set. Moreover, in the case where the convergence is guaranteed, it is possible to define sub-ranges of the several parameters that appear in (3.1) in order that the elements of the proposed class have some desirable properties that are useful for the role of the activation function.

The multi-parametric activation function (3.1) groups several of the standard activation functions proposed in the literature for deep NN. In Table 1, we indicate which cases are included in (3.1).

Table 1 General activation function ${\mathcal {H}}\left( x \right) $

Full size table

Moreover, in [25] is indicated in a detailed form how we derive the activation functions indicated in Table 1 from the general expression (3.1). For example, if we consider $c_1=c_4=0$, $c_2=0$ (for $x<0$) or $c_2=1$ (for $x \ge 0$), $c_3, c_5, c_7 \in \mathbb {N}$, $c_6 \in \mathbb {R}$, and $p=q=0$, we obtain

$$\begin{aligned} {\mathcal {H}}\left( x \right) ={\left\{ \begin{array}{ll} 0 &{} \text{ for } x <0 \\ x &{} \text{ for } x\ge 0 \end{array}\right. }, \end{aligned}$$

(3.2)

which corresponds to the classical Rectified linear unit (ReLU).

Let us now pay attention to a particular case of (3.1), that involves the Bessel function of the first kind $J_{\nu }\left( x \right) $, with $\nu $ a half-positive integer. In fact, if we consider

$$\begin{aligned}{} & {} c_1=c_2=0, \quad c_4=\sqrt{\frac{\pi }{2}} \,\frac{2^{-\nu }}{\Gamma \left( 1+\nu \right) }, \quad c_5=2\nu , \nonumber \\{} & {} p=0, \quad q=1 \,\left( b_1=1+\nu \right) , \quad c_6=-\frac{1}{4}, \quad c_7=2, \end{aligned}$$

(3.3)

in (3.1), we obtain

$$\begin{aligned} {\mathcal {H}}\left( x \right) \,=\frac{2^{-\nu }}{\Gamma \left( 1+\nu \right) } \,x^{2\nu } \,^{}_{0}F_{1}\left( {-; \,1+\nu ; \,-\frac{x^2}{4}}\right) \,=\sqrt{\frac{\pi }{2}} \,x^\nu \,J_{\nu }\left( x \right) , \end{aligned}$$

(3.4)

which corresponds to a one-parameter activation function. It follows from the properties of the Bessel function of the first kind that for half-integers values $\nu $ the activation function (3.4) reduces to the combination of polynomials and elementary trigonometric functions such as $\sin $ and $\cos $. In fact, for the first four positive half-integers, we have that

$$\begin{aligned}{} & {} \nu =\frac{1}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{1}{2}}J_{\frac{1}{2}}\left( x \right) \,=\sin \left( x \right) , \end{aligned}$$

(3.5)

$$\begin{aligned}{} & {} \nu =\frac{3}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{3}{2}}J_{\frac{3}{2}}\left( x \right) \,=\sin \left( x \right) -x \,\cos \left( x \right) , \end{aligned}$$

(3.6)

$$\begin{aligned}{} & {} \nu =\frac{5}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{5}{2}}J_{\frac{5}{2}}\left( x \right) \,=-\left( x^2-3 \right) \sin \left( x \right) -3x \,\cos \left( x \right) , \end{aligned}$$

(3.7)

$$\begin{aligned}{} & {} \nu =\frac{7}{2} \,\Rightarrow \,{\mathcal {H}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \,x^{\frac{7}{2}}J_{\frac{7}{2}}\left( x \right) \,=3\left( 5-2x^2 \right) \sin \left( x \right) +x\left( x^2-15 \right) \cos \left( x \right) . \nonumber \\ \end{aligned}$$

(3.8)

In spite of (3.4) (and also (3.5)–(3.8)) not being monotonic functions in all the positive real line, we can restrict our activation functions to intervals of the form $I_\nu =[0;M_\nu ]$, where $M_\nu $ is the first positive zero of the Bessel function $J_{\nu -1}\left( x \right) $ and corresponds to the first maximum positive point of $J_{\nu }\left( x \right) $. In order to improve our results (see [25]), we consider from now on the following linear combination of (3.5)–(3.8)

$$\begin{aligned} {\mathcal {B}}\left( x \right) \,=\sqrt{\frac{\pi }{2}} \left[ \beta _1 \,x^{\frac{1}{2}} \,J_{\frac{1}{2}}\left( x \right) +\beta _2 \,x^{\frac{3}{2}} \,J_{\frac{3}{2}}\left( x \right) +\beta _3 \,x^{\frac{5}{2}} \,J_{\frac{5}{2}}\left( x \right) +\beta _4 \,x^{\frac{7}{2}} \,J_{\frac{7}{2}}\left( x \right) \right] ,\nonumber \\ \end{aligned}$$

(3.9)

i.e., we combine the Bessel functions (3.5)–(3.8) using trainable parameters to dynamically change how much the contribution of each Bessel function to the final activation function.

4 Bicomplex Convolutional Neural Network

In this section, we define our BCCNN and appropriate parameter initialization. We can understand the BCCNN as a generalization of the quaternionic convolutional neural network (QCNN) (see [5, 15,16,17]) and of the classical real-valued deep CNN (see [11]) to the case we deal with bicomplex numbers. Taking into account [16, 17, 23] about CNN via quaternions and the theory of bicomplex numbers [12] we have that the bicomplex convolution operation is performed via the real-number matrix representation (2.2). Hence, the one-dimensional convolutional layer, with a kernel that contains featured maps, is split into 4 parts: the first part equal to $x_1$, the second one to ${\textbf{i}}y_1$, the third one to ${\textbf{j}}x_2$, and the last to ${\textbf{k}}y_2$ of a bicomplex number $Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2$.

For the activation function, we consider a combination of the so-called split activation introduced in [24] for the quaternionic case with the real-valued activation function (3.9) defined in terms of Bessel functions, i.e.,

$$\begin{aligned} {\mathcal {F}}\left( Z \right) ={\mathcal {B}}\left( x_1 \right) +{\textbf{i}}{\mathcal {B}}\left( y_1 \right) +{\textbf{j}}{\mathcal {B}}\left( x_2 \right) +{\textbf{k}}{\mathcal {B}}\left( y_2 \right) . \end{aligned}$$

(4.1)

Taking into account the properties of the Bessel functions and the ideas presented in [4] we can introduce the concept of threshold function associated to our activation function (4.1).

Definition 4.1

Let $n \ge 1$ and $T \subset \mathbb {C}^2\left( {\textbf{i}} \right) $. Then, a complex valued function $f: T \rightarrow \mathbb {C}$ is called a threshold function if there exists a weighting vector $W =\left( w_0, w_1, w_2 \right) $, where $w_i \in \mathbb {C}\left( {\textbf{i}} \right) $ such that:

$$\begin{aligned} f\left( z_1, z_2 \right) ={\mathcal {F}}\left( w_0 +w_1z_1 +w_2z_2 \right) , \qquad z_1, z_2 \in T. \end{aligned}$$

Moreover, proceeding similarly as in the proof of Theorem 3.3 of [4], we have the following result:

Theorem 4.2

Let $T \subset \mathbb {C}^2\left( {\textbf{i}} \right) $ a bounded domain and $f: T \rightarrow \mathbb {C}$ a threshold function and $\left( w_0, 0 \right) $ a weighting vector of $f\left( z_1,z_2 \right) $. Then, there exists $w_0' \in \mathbb {C}$ and $\delta >0$ such that $\left( w_0', w_1, w_2 \right) $ is a weighting vector of f whenever $\left| w_j\right| <\delta $ with $j=1, 2$.

Differentiable cost guarantees backward propagation. More precisely, the gradient with respect to a loss function J is expressed for each component of the bicomplex weights $w^l$ that composes the matrix $W^l$ at the layer l, being the output layer the quantification of the error with respect to the target vector for each neuron. The convolution of a bicomplex filter matrix with a bicomplex vector is performed taking into account the previous multiplications rules, in fact, let $W =X_1 +{\textbf{i}}Y_1 +{\textbf{j}}X_2 +{\textbf{k}}Y_2$ be a bicomplex weight filter matrix, and $Z =x_1 +{\textbf{i}}y_1 +{\textbf{j}}x_2 +{\textbf{k}}y_2$ the bicomplex input vector. The bicomplex convolution $W \otimes Z$ is defined as follows:

$$\begin{aligned} W \otimes Z= & {} \left( X_1x_1 -Y_1y_1 -X_2x_2 -Y_2y_2 \right) +{\textbf{i}}\left( X_1y_1 +Y_1x_1 +X_2y_2 -Y_2x_2 \right) \nonumber \\{} & {} +{\textbf{j}}\left( X_1x_2 -Y_1y_2 +X_2x_1 +Y_2y_1 \right) +{\textbf{k}}\left( X_1y_2 +Y_1x_2 -X_2y_1 +Y_2x_1 \right) \nonumber \\ \end{aligned}$$

(4.2)

and can thus be expressed in a matrix form following the matrix representation (2.2):

$$\begin{aligned} W \otimes Z =\begin{pmatrix} x_1 &{} -y_1 &{} -x_2 &{} y_2 \\ y_1 &{} x_1 &{}-y_2 &{} -x_2 \\ x_2 &{} -y_2 &{} x_1 &{} -y_1 \\ y_2 &{} x_2 &{} y_1 &{} x_1\end{pmatrix} *\begin{pmatrix} x_1 \\ y_1 \\ x_2 \\ y_2\end{pmatrix} = \begin{pmatrix} x'_1 \\ {\textbf{i}}y'_1 \\ {\textbf{j}}x'_2\\ {\textbf{k}}y'_2 \end{pmatrix} \end{aligned}$$

A suitable initialization scheme improves neural network convergence and reduces the risk of exploding and vanishing gradient. However, bicomplex numbers cannot be initialized component-wise as for the traditional minimization criteria. The reason for this relies on the specific bicomplex algebra and the interaction between the components. Based on the ideas presented in [16, 17], a weight component w of the weight matrix W can be sampled as follows:

$$\begin{aligned}{} & {} w_0 =\lambda \,\cos \left( \theta \right) \nonumber \\{} & {} w_{{\textbf{i}}}=\lambda \,\widetilde{Z}_{{\textbf{i}}} \,\sin \left( \theta \right) , \qquad w_{{\textbf{j}}}=\lambda \,\widetilde{Z}_{{\textbf{j}}} \,\sin \left( \theta \right) , \qquad w_{{\textbf{k}}}=\lambda \,\widetilde{Z}_{{\textbf{k}}} \,\sin \left( \theta \right) . \end{aligned}$$

(4.3)

The angle $\theta $ is randomly generated in the interval $\left[ -\pi , \pi \right] $. The bicomplex $\widetilde{Z}$ is defined as a purely normalized imaginary, and is expressed as $\widetilde{Z} =0 +{\textbf{i}}\widetilde{Z}_{{\textbf{i}}} +{\textbf{j}}\widetilde{Z}_{{\textbf{j}}} +{\textbf{k}}\widetilde{Z}_{{\textbf{k}}}$. The imaginary components ${\textbf{i}}y_1, {\textbf{j}}x_2$, and ${\textbf{k}}y_2$ are sampled from the uniform distribution in $\left[ 0,1 \right] $ to obtain Z, which is then normalized via (2.1) to obtain $\widetilde{Z}$. The parameter $\lambda $ is sampled from $\left[ -\sigma , \sigma \right] $, where (see [16, 17])

$$\begin{aligned} \sigma =\frac{1}{\sqrt{2\left( n_{in} +n_{out} \right) }}, \qquad \qquad \text { and } \qquad \qquad \sigma =\frac{1}{\sqrt{2n_{in}}}, \end{aligned}$$

with $n_{in}$ and $n_{out}$ the number of neurons on the input and the output layers.

5 Numerical Examples

In this final section, we present a simple numerical implementation where we consider the Bessel-type activation function (4.1) and we compare its behaviour with the classical ReLU activation function in order to perform a comparative analysis of the results and show the effectiveness of our approach.

In our numerical simulation, we consider the Colored MNIST dataset and a BCCNN as a baseline model. The MNIST dataset consists of handwritten digits, which contains a training set of 60 000 examples, and a test set of 10,000 examples. Each sample is a $28\times 28$ pixels image with the digits 0–9, the values assigned to the pixels elements range from 0 to 255. To obtain the Colored MNIST we display the training images using a color map and its reversed version. We emphasize that the consideration of a colorized version of the MNIST dataset will be more difficult for the network to train (Fig. 1).

The BCCNN model takes into account (4.3) and it is built in a way that we have a convolutional group, which is composed of 2 convolutional layers, the first one has 1 convolutional filter as input and 25 convolutional filters as output, the second layer has 25 filters as input and 50 filters as outputs, each filter has a kernel size of $3\times 3$. After the convolutional layers, we have a fully connected layer with 28,800 units as input and 100 units outputs followed by the final layer with 100 units inputs and 10 units outputs which gives the final output prediction for the 10 classes expected by the Colorised MNIST dataset. We use ReLU and (4.1) as activation functions with the exception of the last layer where we use a LogSoftmax activation. We employ the negative log-likelihood loss (NLLLoss) and Adam algorithm as optimisers. For the learning rate, we opted to use a dynamic value which is reduced when the loss metric has stopped improving, this is also known as ReduceLROnPlateau. As the initial learning rate value, we follow the guidelines from [22] and choose the value where the gradient towards the minimum loss value is steeper, which in our case, was found to be around $1.8\times 10^{-3}$.

In Fig. 2, we show the performance of the baseline model for the BCCNN models with ReLU as activation function. In Fig. 2A, the dashed red line highlight shows the layers where the gradient towards the minimum loss value is steeper, in this case, $2.031\times \text {e}-04$. In Fig. 2B, the continuous (resp. dot-dashed) line shows the loss (resp. accuracy) for the BCCNN model with ReLU activation function. These results will serve as a benchmark to test against the proposed new activation functions. Now we consider (4.1) as an activation function and we see the behaviour of the BCCNN.

In Fig. 3, the orange (resp. blue) continuous line corresponds to the training (resp. validation) phase for the activation function ${\mathcal {F}}\left( x \right) $ with $\beta _{i} = 1$, while the dot-dashed green (resp. red) line correspond to the results for the baseline model with ReLU activation function training (resp. validation) phase. From the analysis of Fig. 3, we have that for the case where all $\beta _i$ in (4.1) are equal to one (see Fig. 3B), the BCCNN gives poor classification accuracy and also shows a constant behaviour. If we let the values of $\beta _i$ be chosen by the BCCNN as a new parameter during the training phase, we found a better result as displayed in Fig. 3C, although the accuracy on the validation dataset stays around 90%, the maximum accuracy is reached around epoch 20, which shows the advantage of such activation against the traditional ReLU activation.

6 Conclusions

In this paper, we consider bicomplex neural networks with an activation function of Bessel type. The consideration of this new type of activation function leads to better results when compared with the correspondent ones obtained if we consider ReLU. Our numerical experiments reveal that Bessel-type functions combine, in the same activation function, the characteristics of the ReLU and the sinusoid activation functions. In fact, as we indicated in the manuscript, in the case when $\nu $ is a half-integer positive, the Bessel function reduces to a combination of trigonometric and polynomial functions. Compared with the ReLU activation function, Bessel-type functions reach high levels of accuracy more rapidly. Moreover, due to the influence of the sinusoid component, the Bessel-type activation functions have a lower saturation point when compared with the ReLU activation function.

In future work, it would be interesting to consider bicomplex neural networks in more challenging classification tasks, such as the classification of clinical images. Another possible direction consists in considering this new activation function in the quaternionic case, the hyperbolic case, as well in the case of higher dimension hypercomplex algebras, commutative or not. The consideration of these higher algebras simply reduces errors and their implementation. Hypercomplex valued NN allows the accumulation of several complex variables into a single variable theory that can reduce calculations and improve the accuracy of the algorithms.

Data Availibility

The author provide references to all data and material used in this work.

References

Abramowitz, M., Stegun, I.A.: Handbook of mathematical functions with formulas, graphs, and mathematical tables. 10th printing, National Bureau of Standards, Wiley-Interscience Publication, John Wiley & Sons, New York etc., (1972)
Agarwal, R., Sharma, U.P., Agarwal, R.P.: Bicomplex Mittag-Leffler functions and associeted properties. J. Nonlinear Sci. Appl. 15, 48–60 (2022)
Article MathSciNet Google Scholar
Aizenberg, I.N., Aizenberg, N.N., Vandewalle, J.: Multi-Valued and Universal Binary Neurons, Theory, Learning, and Applications. Springer, New York, NY (2000)
MATH Google Scholar
Alpay, D., Dikil, K., Vajiac, M.: A note on the complex and bicomplex valued neural networks, Appl. Math. Comput. 445 (2023), Article No. 127864 (12pp.)
Arena, P., Fortuna, L., Occhipinti, I., Xibilia, M.G.: Neural networks for quaternionic-valued functions approximation, In: 1994 IEEE Internaticnal Symposium on Circuits and Systems - ISCAS94, 6, 307–310
Aspragathos, N.A., Dimitros, J.K.: A comparative study of three methods for robot kinematics, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 28No.2 (1998), 135–145
De Bie, H., Struppa, D., Vajiac, A., Vajiac, M.: The Cauchy-Kowalewski product for bicomplex holomorphic functions. Math. Nachr. 285(10), 1230–1242 (2012)
Article MathSciNet MATH Google Scholar
Bingham, G., Miikkulainen, R.: Discovering parametric activation functions, ArXiv preprint, ArXiv:2006.03179v4, (2006)
Cerejeiras, P., Fu, Y., Gomes, N.: Bicomplex signals with sparsity constraints. Math Meth. Appl Sci. 41(13), 5140–5158 (2018)
Article MathSciNet MATH Google Scholar
Colombo, F., Sabadini, I., Struppa, D.C., Vajiac, A., Vajiac, M.B.: Singularities of functions of one and several bicomplex variables. Ark. Mat. 49(2), 277–294 (2011)
Article MathSciNet MATH Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition in Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778, (2016)
Luna-Elizarraras, E., Shapiro, M., Struppa, D.C., Vajiac, A.: Bicomplex Holomorphic Functions. The Algebra, Geometry and Analysis of Bicomplex Numbers, Birkhüser/Springer, Cham (2015)
Book MATH Google Scholar
Maguolo, G., Nanni, L., Ghidoni, S.: Ensemble of convolutional neural networks trained with different activation functions, Expert Syst. Appl. 166, Article No. 114048 (8pp.), (2021)
Paneva-Konovska, J.: Bessel type functions as multi-index Mittag-Leffler functions: Erdélyi-Kober integral relations, AIP Conference Proceedings 2333, Article No.060003 (8pp.), (2021)
Parcollet, T., Ravanelli, M., Morchid, M., Linarès, G., Trabelsi, C., De Mori, R., Bengio, Y.: Quaternion recurrent neural networks, In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019, Annual Conference of the International Speech Communication Association (INTERSPEECH), 2018, Article No. 149936 (19pp.)
Parcollet, T., Morchid, M., Linarès, G., De Mori, R.: Quaternion Convolutional Neural Networks for Theme Identification of Telephone Conversations, In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT 2018), Article No. 145107 (7pp.), (2018)
Parcollet, T., Zhang, Y., Morchid, M., Trabelsi, C., Linarès, G., De Mori, R., Bengio, Y.: Quaternion convolutional neural networks for end-to-end automatic speech recognition, In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 22–26, (2018)
Pei, S.C., Chang, J.H., Ding, J.J.: Commutative reduced biquaternions and their fourier transform for signal and image processing applications. IEEE Trans Signal Process. 52(7), 2012–2031 (2004)
Article ADS MathSciNet MATH Google Scholar
Prudnikov, A.P., Brychkov, Yu., Marichev, O.I.: Integrals and series. Volume 3: More special functions, Transl. from the Russian by G. G. Gould, Gordon and Breach Science Publishers, New York, (1990)
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. Adv Neural Inf Process Syst. 2017, 3857–3867 (2017)
Google Scholar
Sangwine, S.J.: Fourier transforms of colour images using quaternion or hypercomplex, numbers. Electronics letters 32(21), 1979–1980 (1996)
Article ADS Google Scholar
Smith, L.N.: A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay, arXiv preprint arXiv:1803.09820, (2018)
Trabelsi, C., Bilaniuk, O., Serdyuk, D., Subramanian, S., Santos, J.F., Mehri, S., Rostamzadeh, N., Bengio, Y., Pal, C.J.: Deep complex networks, ArXiv preprint, ArXiv:1705.09792, (2017)
Ujang, B.C., Took, C.C., Mandic, D.P.: Quaternion valued nonlinear adaptative filtering. IEEE Trans. Neural Ntew. 28(8), 1193–1206 (2011)
Article Google Scholar
Vieira, N., Freitas, F.: Hypergeometric functions as activation functions: the particular case of the Bessel type functions, submitted
Zamora-Esquivel, J., Vargas, A.C., Camacho-Perez, J.R., Meyer, P.L., Cordourier, H., Tickoo, O.: Adaptative activation functions using fractional calculus, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2006–2013, (2019)

Download references

Funding

Open access funding provided by FCT|FCCN (b-on). The work of the author was supported by Portuguese funds through CIDMA-Center for Research and Development in Mathematics and Applications, and FCT-Fundação para a Ciência e a Tecnologia, within projects UIDB/04106/2020 and UIDP/04106/2020. He was also supported by FCT via the 2018 FCT program of Stimulus of Scientific Employment-Individual Support (Ref: CEECIND/01131/2018). N. Vieira expresses his appreciation for the support he has received from the projects Machine Learning and Special Functions as Activation Functions in Image Processing (Ref: CPCA/A1/421343/2021) and Hypergeometric Functions and Machine Learning in the Diagnosis Process (Ref: CPCA-IAC/AV/475089/2022), and from the German Research Foundation (Ref: SM 281/15-1).

Author information

Authors and Affiliations

CIDMA-Center for Research and Development in Mathematics and Applications, Department of Mathematics, University of Aveiro, Campus Universitário de Santiago, 3810-193, Aveiro, Portugal
Nelson Vieira

Authors

Nelson Vieira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nelson Vieira.

Ethics declarations

Conflict of interest

The author declares that he has not a conflict of interest.

Additional information

Communicated by Uwe Kaehler.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vieira, N. Bicomplex Neural Networks with Hypergeometric Activation Functions. Adv. Appl. Clifford Algebras 33, 20 (2023). https://doi.org/10.1007/s00006-023-01268-w

Download citation

Received: 08 February 2023
Accepted: 21 February 2023
Published: 13 March 2023
DOI: https://doi.org/10.1007/s00006-023-01268-w

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Bicomplex Neural Networks with Hypergeometric Activation Functions

Abstract

Similar content being viewed by others

Complex-Valued Densely Connected Convolutional Networks

A new approach to neural networks using pseudo-differential operators

On Bicomplex Fibonacci Numbers and Their Generalization

1 Introduction