Bicomplex Neural Networks with Hypergeometric Activation Functions

Bicomplex convolutional neural networks (BCCNN) are a natural extension of the quaternion convolutional neural networks for the bicomplex case. As it happens with the quaternionic case, BCCNN has the capability of learning and modelling external dependencies that exist between neighbour features of an input vector and internal latent dependencies within the feature. This property arises from the fact that, under certain circumstances, it is possible to deal with the bicomplex number in a component-wise way. In this paper, we present a BCCNN, and we apply it to a classification task involving the colourized version of the well-known dataset MNIST. Besides the novelty of considering bicomplex numbers, our CNN considers an activation function a Bessel-type function. As we see, our results present better results compared with the one where the classical ReLU activation function is considered.


Introduction
Convolutional Neural Networks (CNN) are one of the most used tools in artificial intelligence. In recent years, they became a key tool in numerous fields like image classification, face recognition and machine translation (see [13] and references therein). The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the Neural Network (NN), which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration. Independently of the considered approach, it is clear that the task of introducing new activation functions is, for sure, not easy and its benefits are sometimes limited. This limitation comes from the fact that the several proposals presented in the literature reveal to be inconsistent and task-dependent, and therefore the classical activation functions (for example, the ReLU ) maintain their predominance in practical applications.
The field of quaternions is the best-known extension of the field of complex numbers to the four-dimensional setting. On one hand, one of the main advantages of this extension is that quaternions form a field where we can consider all the customary operations. On the other hand, when we handle quaternions, we do not have commutativity of the product, which brings several unwanted problems in the process of extension of the theory of holomorphic functions for one complex variable. A possible way to overcome the issue of non-commutativity is to consider the bicomplex numbers, which is a four-dimensional algebra with classical operations, that contains C as a subalgebra and preserves the commutativity. Just as in the case of complex numbers, where the individual components of the complex number can be treated independently as two real numbers, the bicomplex space can be treated as two complex or four real numbers.
The bicomplex algebra can be applied in several fields. For example, in [18], the authors point out that the so-called IHS-colour space representation (i.e., Intensity-Hue-Saturation) has broad applications, particularly in human vision, can be mathematically represented by having values in bicomplex numbers. In [8] the authors take vantage of the idempotent representation of bicomplex algebra to prove the possibility of reconstructing a bicomplex sparse signal, with high probability, from a reduced number of bicomplex random samples. Moreover, the possibility of having four separate components allows the consideration of three-and four-dimensional input feature vectors, which are associated with image processing and robot kinematics [6,21], and also with the field of capsule networks [20]. Thereby, a bicomplex neural network-based models are able to code latent interdependencies between groups of input features during the learning process with fewer parameters than traditional NN, by taking advantage of the component-wise multiplication that can be established between bicomplex numbers. The bicomplex approach gives a better way of combining two-valued complex networks compared to the usual two-valued approach in the literature [3], as we can treat the theory as a single variable one due to its algebra structure. In our approach, the structure of the bicomplex algebra is essential in providing a convergent algorithm. For more details about the theory of bicomplex numbers and its applications we refer the interested reader to [4,9,10,12].
The objective of this paper is to introduce a BCCNN where a generic activation function of Bessel-type is considered. The purpose of our idea is to generalize the results presented in the literature for the quaternionic case and to present an application of the novel activation functions of hypergeometric type introduced in [25]. The structure of the paper reads as follows: Sect. 2 is dedicated to recalling some basic definitions of the bicomplex numbers and special functions, which are necessary for the understanding of this work. In Sect. 3 based on the ideas presented in [25] we introduce a general family of hypergeometric activation functions with several trainable parameters. In the following section, we describe our BCCNN. We perform some numerical experiments in the last section using the Colored MNIST, to validate our approach.

Bicomplex Algebra
The set BC of bicomplex numbers is defined by where C is the set os complex numbers with the imaginary unit i, and where i and j are commuting imaginary units, i.e., ij = k = ji, i 2 = j 2 = −1, and k 2 = +1. Bicomplex numbers can be added and multiplied, in fact, if Z = z 1 + jz 2 and W = w 1 + jw 2 are two bicomplex numbers, we have that where the product is not component-wise. From the previous multiplication rules involving i, j, and k, we decompose BC in two ways. Indeed, BC = C (i) + jC (i), where Likewise, we have BC = C (j) + iC (j). Moreover, if we consider z 1 = x 1 + iy 1 and z 2 = x 2 + ix 2 , with x 1 , y 1 , x 2 , y 2 ∈ R, we have the following alternative form of presenting a bicomplex number The structure of BC suggest three possible conjugations on BC: • the bar-conjugation: Z = z 1 + jz 2 ; • the †-conjugation: where z 1 and z 2 are usual complex conjugates to z 1 , z 2 ∈ C (i). The euclidian norm of a bicomplex number Z on BC, when it is seen as or as is given by The bicomplex space, BC, is not a division algebra, and it has two distinguished zero divisors, namely, e 1 and e 2 , which are idempotent, linearly independent over reals, and mutually annihilating with respect to the bicomplex multiplications (see [12, Prop. 1.6.1]): We have that {e 1 , e 2 } form an idempotent basis of the complex algebra BC.
Considering the complex numbers β 1 := z 1 −iz 2 and β 2 : This idempotent representation is the only representation for which multiplication is component-wise, as it is indicated in the next proposition Z n = a n 1 e 1 + a 2 2 e 2 .
The multiplicative inverse of a bicomplex number Z = a 1 e 1 + a 2 e 2 , with a 1 · a 2 = 0 is given by and a −1 2 are the complex multiplicative inverses of a 1 and a 2 , respectively (see [12, Thm. 1.6.5]). A bicomplex number may be identified with real 4 × 4 matrices (that turns out to be more suitable for computations): We have that every 4 × 4 matrix determines a linear (more exactly, a real linear) transformation in R 4 , however, not all of them remain BC-linear when R 4 is seen as BC. In the context of bicomplex convolutional neural networks, there are some activation functions proposed in the literature involving bicomplex numbers. For example, in [4] the authors considered the activation function P (z) = l in T ⊂ C n , where = exp 2πi k be the root of the unity of order k, and whenever 2πil Vol. 33 (2023) Bicomplex Neural Networks with Hypergeometric Activation . . .Page 5 of 14 20

Special Functions
In this work, we make use of the generalized hypergeometric function p F q , which is defined by (see [19]) where the convergence is guaranteed if one of the following conditions is satisfied: If parameters a k include negative integers, the function (2.3) degenerates to a polynomial in z. Another special function that will play an important role in this work is the Bessel function of the first kin J ν , which is defined by the following series (see [1]) (2.5) The Bessel function is related to the hypergeometric function 0 F 1 by the following relation (see [1]) when ν is a non-negative integer. For more details about hypergeometric functions and other special functions, we refer, for example, to [1,2,14].

General Activation Functions
The correct choice of the activation function can significantly affect the performance of CNN. Sometimes, a chosen activation function does not possess all the necessary properties/characteristics for a specific CNN. Usually, the process of selection of activation functions is manual and relies essentially on the architecture of the NN, which leads to exhaustive trial-and-error methodologies, where the NN is retrained for each activation function until the optimal configuration. In this sense, and following the ideas presented in [25], we consider in our CNN a general multi-parametric activation function in the context of automatic activation function design. The concepts of parametric activation  [7], and the adaptative activation functions introduced in [26] inspired our work, and were adapted to deal with hypergeometric functions. More precisely, we consider the following activation function: where c 1 , c 2 , c 4 , c 6 ∈ R, c 5 ∈ N 0 , c 3 , c 7 ∈ N, and the parameters in the hypergeometric function satisfy (2.4). Due to a large number of parameters, it is possible to use (3.1) to approximate every continuous function on a compact set. Moreover, in the case where the convergence is guaranteed, it is possible to define sub-ranges of the several parameters that appear in (3.1) in order that the elements of the proposed class have some desirable properties that are useful for the role of the activation function. The multi-parametric activation function (3.1) groups several of the standard activation functions proposed in the literature for deep NN. In Table  1, we indicate which cases are included in (3.1).
Moreover, in [25] is indicated in a detailed form how we derive the activation functions indicated in Table 1 from the general expression (3.1). For example, if we consider c 1 = c 4 = 0, c 2 = 0 (for x < 0) or c 2 = 1 (for x ≥ 0), c 3 , c 5 , c 7 ∈ N, c 6 ∈ R, and p = q = 0, we obtain which corresponds to the classical Rectified linear unit (ReLU).
Let us now pay attention to a particular case of (3.1), that involves the Bessel function of the first kind J ν (x), with ν a half-positive integer. In fact, if we consider in (3.1), we obtain ν the activation function (3.4) reduces to the combination of polynomials and elementary trigonometric functions such as sin and cos. In fact, for the first four positive half-integers, we have that In order to improve our results (see [25]), we consider from now on the following linear combination of (3.5)-(3.8) i.e., we combine the Bessel functions (3.5)-(3.8) using trainable parameters to dynamically change how much the contribution of each Bessel function to the final activation function.

Bicomplex Convolutional Neural Network
In this section, we define our BCCNN and appropriate parameter initialization. We can understand the BCCNN as a generalization of the quaternionic convolutional neural network (QCNN) (see [5,[15][16][17]) and of the classical real-valued deep CNN (see [11]) to the case we deal with bicomplex numbers. Taking into account [16,17,23] about CNN via quaternions and the theory of bicomplex numbers [12] we have that the bicomplex convolution operation is performed via the real-number matrix representation (2.2). Hence, the onedimensional convolutional layer, with a kernel that contains featured maps, is split into 4 parts: the first part equal to x 1 , the second one to iy 1 , the third one to jx 2 , and the last to ky 2 of a bicomplex number Z = x 1 +iy 1 +jx 2 +ky 2 .
For the activation function, we consider a combination of the so-called split activation introduced in [24] for the quaternionic case with the realvalued activation function (3.9) defined in terms of Bessel functions, i.e., Taking into account the properties of the Bessel functions and the ideas presented in [4] we can introduce the concept of threshold function associated to our activation function (4.1).

Definition 4.1.
Let n ≥ 1 and T ⊂ C 2 (i). Then, a complex valued function f : T → C is called a threshold function if there exists a weighting vector W = (w 0 , w 1 , w 2 ), where w i ∈ C (i) such that: Moreover, proceeding similarly as in the proof of Theorem 3.3 of [4], we have the following result: bounded domain and f : T → C a threshold  function and (w 0 , 0) a weighting vector of f (z 1 , z 2 ). Then, there exists w 0 ∈ C and δ > 0 such that (w 0 , w 1 , w 2 ) is a weighting vector of f whenever |w j | < δ with j = 1, 2.
Differentiable cost guarantees backward propagation. More precisely, the gradient with respect to a loss function J is expressed for each component of the bicomplex weights w l that composes the matrix W l at the layer l, being the output layer the quantification of the error with respect to the target vector for each neuron. The convolution of a bicomplex filter matrix with a bicomplex vector is performed taking into account the previous multiplications rules, in fact, let W = X 1 + iY 1 + jX 2 + kY 2 be a bicomplex weight filter matrix, and Z = x 1 + iy 1 + jx 2 + ky 2 the bicomplex input vector. The bicomplex convolution W ⊗ Z is defined as follows: and can thus be expressed in a matrix form following the matrix representation (2.2): A suitable initialization scheme improves neural network convergence and reduces the risk of exploding and vanishing gradient. However, bicomplex numbers cannot be initialized component-wise as for the traditional minimization criteria. The reason for this relies on the specific bicomplex algebra and the interaction between the components. Based on the ideas presented in [16,17], a weight component w of the weight matrix W can be sampled as follows: The angle θ is randomly generated in the interval [−π, π]. The bicomplex Z is defined as a purely normalized imaginary, and is expressed as Z = The imaginary components iy 1 , jx 2 , and ky 2 are sampled from the uniform distribution in [0, 1] to obtain Z, which is then normalized via (2.1) to obtain Z. The parameter λ is sampled from [−σ, σ], where (see [16,17]) with n in and n out the number of neurons on the input and the output layers.

Numerical Examples
In this final section, we present a simple numerical implementation where we consider the Bessel-type activation function (4.1) and we compare its behaviour with the classical ReLU activation function in order to perform a comparative analysis of the results and show the effectiveness of our approach. In our numerical simulation, we consider the Colored MNIST dataset and a BCCNN as a baseline model. The MNIST dataset consists of handwritten digits, which contains a training set of 60 000 examples, and a test set of 10,000 examples. Each sample is a 28 × 28 pixels image with the digits 0-9, the values assigned to the pixels elements range from 0 to 255. To obtain the Colored MNIST we display the training images using a color map and its reversed version. We emphasize that the consideration of a colorized version of the MNIST dataset will be more difficult for the network to train (Fig. 1).
The BCCNN model takes into account (4.3) and it is built in a way that we have a convolutional group, which is composed of 2 convolutional layers, the first one has 1 convolutional filter as input and 25 convolutional filters as output, the second layer has 25 filters as input and 50 filters as outputs, each filter has a kernel size of 3 × 3. After the convolutional layers, we have a fully connected layer with 28,800 units as input and 100 units outputs followed by the final layer with 100 units inputs and 10 units outputs which gives the final output prediction for the 10 classes expected by the Colorised MNIST dataset. We use ReLU and (4.1) as activation functions with the exception of the last layer where we use a LogSoftmax activation. We employ the negative log-likelihood loss (NLLLoss) and Adam algorithm as optimisers. For the learning rate, we opted to use a dynamic value which is reduced when the loss metric has stopped improving, this is also known as ReduceLROnPlateau.  As the initial learning rate value, we follow the guidelines from [22] and choose the value where the gradient towards the minimum loss value is steeper, which in our case, was found to be around 1.8 × 10 −3 . In Fig. 2, we show the performance of the baseline model for the BCCNN models with ReLU as activation function. In Fig. 2A, the dashed red line highlight shows the layers where the gradient towards the minimum loss value is steeper, in this case, 2.031 × e − 04. In Fig. 2B, the continuous (resp. dot-dashed) line shows the loss (resp. accuracy) for the BCCNN model with ReLU activation function. These results will serve as a benchmark to test against the proposed new activation functions. Now we consider (4.1) as an activation function and we see the behaviour of the BCCNN.
In Fig. 3, the orange (resp. blue) continuous line corresponds to the training (resp. validation) phase for the activation function F (x) with β i = 1, while the dot-dashed green (resp. red) line correspond to the results for the baseline model with ReLU activation function training (resp. validation) phase. From the analysis of Fig. 3, we have that for the case where all β i in (4.1) are equal to one (see Fig. 3B), the BCCNN gives poor classification accuracy and also shows a constant behaviour. If we let the values of β i be chosen by the BCCNN as a new parameter during the training phase, we found a better result as displayed in Fig. 3C, although the accuracy on the validation dataset stays around 90%, the maximum accuracy is reached around epoch 20, which shows the advantage of such activation against the traditional ReLU activation.

Conclusions
In this paper, we consider bicomplex neural networks with an activation function of Bessel type. The consideration of this new type of activation function leads to better results when compared with the correspondent ones obtained if we consider ReLU. Our numerical experiments reveal that Bessel-type functions combine, in the same activation function, the characteristics of the ReLU and the sinusoid activation functions. In fact, as we indicated in the In future work, it would be interesting to consider bicomplex neural networks in more challenging classification tasks, such as the classification of clinical images. Another possible direction consists in considering this new activation function in the quaternionic case, the hyperbolic case, as well in the case of higher dimension hypercomplex algebras, commutative or not. The consideration of these higher algebras simply reduces errors and their implementation. Hypercomplex valued NN allows the accumulation of several complex variables into a single variable theory that can reduce calculations and improve the accuracy of the algorithms.

Data Availibility
The author provide references to all data and material used in this work.

Conflict of interest
The author declares that he has not a conflict of interest.
Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.