Quaternionic Convolutional Neural Networks with Trainable Bessel Activation Functions

Quaternionic convolutional neural networks (QCNN) possess the ability to capture both external dependencies between neighboring features and internal latent dependencies within features of an input vector. In this study, we employ QCNN with activation functions based on Bessel-type functions with trainable parameters, for performing classification tasks. Our experimental results demonstrate that this activation function outperforms the traditional ReLU activation function. Throughout our simulations, we explore various network architectures. The use of activation functions with trainable parameters offers several advantages, including enhanced flexibility, adaptability, improved learning, customized model behavior, and automatic feature extraction.


Introduction
Convolutional Neural Networks (CNNs) are widely used in the field of artificial intelligence and have become a key tool in various applications such as image classification, face recognition, and machine translation (see [11] and references therein).The choice of activation function plays a crucial role in the performance of CNNs.However, in some cases, the selected activation function may not possess all the necessary properties for a specific CNN.Currently, the process of selecting activation functions is typically done manually, relying heavily on the network architecture and often involving exhaustive trial-and-error methodologies.This means that the network needs to be retrained for each activation function until the optimal configuration is found.
Regardless of the approach taken, it is evident that introducing new activation functions is a challenging task, and their benefits are sometimes limited.This limitation arises from the fact that many proposed activation functions in the literature are inconsistent and task-dependent.As a result, classical activation functions such as the Rectified Linear Unit (ReLU) continue to dominate practical applications due to their proven effectiveness.
One of the major challenges in machine learning is to compute appropriate representations of large datasets in a robust latent space.A good model needs to efficiently encode both local and structural relations within the input features, such as the Red, Green, and Blue (R,G,B) channels of a single pixel and the composition of pixels to form edges or shapes.Traditional neural networks often struggle to capture these intricate relationships effectively, and the selection of suitable activation functions becomes crucial.However, the introduction of quaternionic neural networks has revolutionized this aspect.Quaternionic Convolutional Neural Networks (QCNN) leverage the enhanced representation power of quaternions, providing a higher-dimensional representation compared to real numbers (see [4,12,[12][13][14]).By using quaternionic numbers, QCNN can capture more complex relationships and patterns in the data, especially when dealing with spatial or sequential data with inherent quaternionic properties.
Another advantage of QCNN is their efficient representation of 3D rotations.Quaternions are particularly well-suited for representing 3D rotations and can encode rotation information with fewer parameters compared to alternative representations like Euler angles or rotation matrices.This makes QCNN highly effective for tasks involving 3D data, such as computer vision, robotics, or augmented/virtual reality.QCNN also offer reduced parameter redundancy.By exploiting the properties of quaternions, these networks can reduce redundancy in parameter sharing across different channels.This results in a more compact representation compared to realvalued networks with the same expressive power, leading to more efficient models with improved generalization capabilities.The geometric interpretability of quaternionic numbers is another advantage.Quaternions have a geometric interpretation that facilitates reasoning about spatial relationships and transformations.This interpretability is particularly advantageous in applications involving geometric reasoning, such as computer graphics, computer vision, and robotics (see [5]).QCNN introduce quaternion convolutional operations, which generalize traditional convolutions to quaternionic data.These operations allow for the capture of spatial patterns and correlations in quaternionic images or volumes, allowing for more expressive and efficient convolutional layers (see [9]).
The goal of this work is to explore the potential of incorporating a generic activation function of Bessel-type with trainable parameters in a Quaternionic Neural Network (QCNN).Building upon the novel activation functions of hypergeometric type introduced in [18], our aim is to demonstrate their applicability in the context of QCNN.By introducing trainable parameters to the activation function, we harness several advantages that enhance the performance and flexibility of the network.Firstly, activation functions with trainable parameters offer enhanced representation power by capturing more complex relationships and patterns within the data.The ability to adapt the activation function based on the specific task at hand allows the network to learn and encode intricate features more effectively.Secondly, the introduction of trainable parameters enables a more efficient representation of 3D rotations in quaternionic neural networks.Quaternions are particularly suitable for representing 3D rotations, and by leveraging the efficient quaternionic representation, the network can effectively handle tasks involving 3D data, such as computer vision, robotics, and augmented/virtual reality.Additionally, activation functions with trainable parameters reduce parameter redundancy in the network.By sharing parameters across different channels, these functions provide a more compact representation compared to traditional real-valued networks.This reduction in redundancy leads to more efficient models with improved generalization capabilities.Moreover, activation functions with trainable parameters offer a geometric interpretability.The geometric interpretation of quaternionic numbers facilitates reasoning about spatial relationships and transformations.This interpretability is particularly advantageous in applications involving geometric reasoning, such as computer graphics, computer vision, and robotics.
By incorporating activation functions with trainable parameters, we combine the advantages of adaptability, efficient representation, parameter reduction, and geometric interpretability in quaternionic neural networks.This empowers the network to learn and represent complex data more effectively, leading to improved performance and broader applicability in various domains.
To provide the necessary background for understanding our work, Sect.2, of the paper presents a brief overview of the quaternionic numbers and special functions.These concepts are essential for grasping the theoretical foundations of our approach.In Sect.3, we leverage the ideas presented in [18] to introduce a general family of hypergeometric activation functions with multiple trainable parameters.These functions offer flexibility and expressiveness, allowing for fine-tuning of the activation behavior in the QCNN architecture.Subsequently, we delve into the details of our Quaternionic Neural Network in the following section.We outline the architecture and highlight the unique aspects and considerations specific to quaternionic computations.To validate the effectiveness of our approach, we conduct a series of numerical experiments in the last section.We utilize the Colored FashionMNIST dataset as the basis for our evaluations, aiming to demonstrate the improved performance and capabilities of our proposed method.
By combining the power of hypergeometric activation functions and quaternionic computations, our study contributes to advancing the field of Quaternionic Neural Networks and presents a promising avenue for enhancing the representation and processing of complex data structures.

Basics on Quaternionic Analysis
In this section we recall some basic definitions about quaternionic analysis (for more detail we refer the interested reader to [7] and the references there in indicated).Let 1, e 1 , e 2 , e 3 , where we identify e 0 by 1, an orthonormal basis of R 4 .We introduce an associative multiplication of the basis vectors subject to the multiplication rules: This non-commutative product generates the algebra of real quaternions denoted by H.The real vector space R 4 will be embedded in H by identifying the element x = (x 0 , x 1 , x 2 , x 3 ) ∈ R 4 with the element The real number Sc x := x 0 is called the scalar part of x and Vec x := x 0 + x 1 e 1 + x 2 e 2 + x 3 e 3 is the vector part of x, or pure quaternion.Analogous to the complex case, the conjugate of x is the quaternion The norm of x is given by |x| = √ x x and coincides with the corresponding Euclidian norm of x, as a vector of R 4 .The previous definition can be used to describe spatial rotations.For the purpose of numerical calculations, we consider the following representation of a quaternion x in terms of a matrix of real numbers The Hamiltonian product between two quaternions is given by (2.2)

Special Functions
In this work, we make use of the generalized hypergeometric function p F q , which is defined by (see [15]) where the convergence is guaranteed if one of the following conditions is satisfied: Moreover, (2.3) is an analytical function of a 1 , . . ., a p , b 1 , . . ., b q and z which is defined in C p+q+1 .In the cases p ≤ q for fixed a 1 , . . ., a p , b 1 , . . ., b q , it is an entire function of z.If parameters a k include negative integers, the function (2.3) degenerates to a polynomial in z.
Another special function that will play an important role in this work is the Bessel function of the first kin J ν , which is defined by the following series (see [1]) ( The Bessel function is related to the hypergeometric function 0 F 1 by the following relation (see [1]) when ν is a non-negative integer.For more details about hypergeometric functions and other special functions, we refer, for example, to [1].

General Activation Functions with Trainable Parameters
The performance of a Convolutional Neural Network (CNN) is greatly influenced by the choice of its activation function.However, selecting the most suitable activation function for a specific CNN can be challenging.Often, the process involves manual selection based on the NN's architecture, leading to exhaustive trial-and-error approaches where the NN is retrained for each activation function until the optimal configuration is found.
To address this issue, we draw inspiration from the concepts presented in [17,18] and propose the integration of a general multi-parametric activation function in our CNN.This approach falls within the realm of automatic activation function design.We build upon the concepts of parametric activation functions introduced in [6] and adaptative activation functions discussed in [19], tailoring them to handle hypergeometric functions.These adaptations serve as a foundation for our work in achieving automatic activation function selection in CNNs.More precisely, we consider the following activation function: ).This allows the elements within the proposed class to possess desirable properties that are beneficial for their role as activation functions.As it is stated in [17,18], the multi-parametric activation function (3.1) groups several of the standard activation functions proposed in the literature for deep NN.In Table 1, we indicate which cases are included in (3.1):Moreover, in [18] is indicated in a detailed form how we derive the activation functions indicated in Table 1 from the general expression (3.1).In our work, we consider a particular case of (3.1), that involves the Bessel function of the first kind J ν (x), with ν a half-positive integer.In fact, if we consider in (3.1), we obtain which corresponds to a one-parameter activation function.It follows from the properties of the Bessel function of the first kind that for half-integers values ν the activation function (3.3) reduces to the combination of polynomials and elementary trigonometric functions such as sin and cos.In fact, for the first four positive half-integers, we (3.7) Following the ideas presented in (see [17,18]), we consider the following linear combination of (3.4-3.7) i.e., we combine the Bessel functions (3.4-3.7) using trainable parameters to dynamically change how much the contribution of each Bessel function to the final activation function.The previous activation function can be also understood as a parametric activation function in the sense presented in [6].In spite of (3.8) not being monotonic function in all the positive real line, we can restrict our activation functions to the interval where M is the first positive zero of the first derivative of F. In Fig. 1 we present some plots of (3.8) for the following particular cases: The general activation function introduced in (3.8) satisfies the following definition of activation function in the quaternionic setting (the bicomplex version of this definition was introduced in [3]).Definition 3.1 Let x = x 0 + x 1 e 1 + x 2 e 2 + x 3 e 3 , y = y 0 + y 1 e 1 + y 2 e 2 + y 3 e 3 , z = z 0 + z 1 e 1 + z 2 e 2 + z 3 e 3 in H.We define the quaternionic activation function P by where F is the activation function given in (3.8), and (x ⊗ y) i , with i = 0, 1, 2, 3, represents each component of the Hamiltonian product described in (2.2).
Moreover, the previous definition allow us to introduce the notion of threshold function for the quaternionic case: Then, a quaternionic valued function f : → H is called a H-threshold function if there exists a real-valued weights w 0 , w 1 , w 2 , and w 3 such that f (x) = P (w 0 x 0 + w 1 x 1 e 1 + w 2 x 2 e 2 + w 3 x 3 e 3 ), where x ∈ H.
In QCNN, threshold functions play a crucial role because they introducing nonlinearity and decision-making capabilities.They enable QCNN to capture complex relationships and make binary decisions based on predefined thresholds.By leveraging the unique properties of quaternion algebra, threshold functions in QCNN are particularly beneficial for applications involving 3D rotations, computer vision, robotics, and augmented/virtual reality.Additionally, they help reduce parameter redundancy, leading to more efficient models with improved generalization capabilities.Taking into account Theorem 2.1 in [2] and the general activation function (3.8) we have the following theorem about the quaternionic perceptron algorithm.

Theorem 3.3 Let be a bounded domain in H. Let f :
→ H be a H-threshold function and (w 0 , 0, 0, 0) a weighting vector of f (x 0 , x 1 , x 2 , x 3 ).Then, there exists w 0 ∈ H and δ > 0 such that w 0 , w 1 , w 2 , w 3 is a weighting vector of f for every w 1 , w 2 , w 3 that satisfy Proof The proof is a straightforward adaptation of the bicomplex case presented in [3].
Finally, we can establish the following universal approximation theorem Theorem 3.4 Let be a bounded domain in H and C ( ) be the space of all continuous functions on .Then, given a function f ∈ C ( ) and > 0, there exists an integer m and sets of constants α i , γ i , and weights w i j , where i = 1, . . ., m and j = 0, 1, 2, 3 such that we may define as an approximate realization of the function f that is |F (x) − f (x)| < , for all x = x 0 + x 1 e 1 + x 2 e 2 + e 3 e 3 ∈ .
Proof Since our general activation function F (x), given by (3.8), with x ∈ I F , is a nonconstant, bounded, and monotone-increasing function, we have that each component of the quaternionic activation (3.9) is in the conditions of the Universal Approximation Theorem presented in Section 4.12 of [8].Hence the proof of Theorem 3.4 reduces to apply the arguments presented in [8] for each component of our quaternionic activation function.
Theorem 3.4 serves as an existence theorem by providing mathematical justification for approximating any continuous function rather than achieving an exact representation.Moreover, The application of the universal approximation theorem directly pertains to multilayer perceptrons.We observe that Equation (3.10) represents the output of a multilayer perceptron described as follows: • The perceptron comprises 4 input nodes and a solitary hidden layer that contains m neurons.The inputs are denoted as x 0 , x 1 , x 2 , and x 3 .• Each hidden neuron, indexed by i, possesses synaptic weights w i0 , w i1 , w i2 , w i3 , and a bias term b i .• The output of the network is obtained by linearly combining the outputs of the hidden neurons, with α 1 , . . ., α m representing the synaptic weights of the output layer.

Quaternionic Convolutional Neural Network
In our work we consider the QCNN introduced in [12-14], but with the activation function introduced in (3.8).From the properties of the quaternions indicated previously and the definition of quaternionic activation function, we have that a one-dimensional convolutional layer is split in four components (one for each e k , with k = 0, 1, 2, 3).The backpropagation is ensure by differentiable cost and for any activation function B we have The convolution of a quaternion filter matrix with a vector is expressed in the following matrix form where W = W 0 + W 1 e 1 + W 2 e 2 + W 3 e 3 is a quaternionic weight filter and x = x 0 + x 1 e 1 + x 2 e 2 + x 3 e 3 is a quaternionic input vector.As it is pointed out in [13], we need consider a particular algorithm for the initialization of the quaternionic parameters.In this sense, a weight component w of a Weight matrix W can be sample as follows The angle θ is randomly generated in the interval [−π, π].The bicomplex x is defined as a normalized imaginary, and is expressed as x |x| .The components x 1 , x 2 , x 3 are sampled from the uniform distribution in [0, 1].The parameter λ is sampled from [−σ, σ ], where (see [13,14]) with n in and n out the number of neurons on the input and the output layers.

Numerical Examples
In this final section, we present a simple implementation of QCNN with the Besseltype activation function (3.8) and we compare its behaviour with the classical ReLU activation function in order to perform a comparative analysis of the results and show the effectiveness of our approach.
In our numerical simulation, we consider the Color FashionMNIST dataset and a QCNN as a baseline model.The FashionMNIST dataset comprises a training set and a test set, with a total of 70,000 images.The training set contains 60,000 images, while the test set contains 10,000 images.Each image in the dataset is a 28 × 28-pixel grayscaling image, resulting in a total of 784 pixels per image.The pixel values range from 0 to 255, representing different shades of gray.Each training and test example is assigned to one of the following labels To obtain the Color FashionMNIST we display the training images using a color map.We emphasize that the consideration of a colorized version of the MNIST dataset will be more difficult for the network to train (Fig. 2).
Taking into account (4.2) we consider three different architectures for or QCNN: • Simple Neural Network: this QCNN was built in a way that we have a convolutional group, which is composed of 2 convolutional layers, the first one has 1 convolutional filter as input and 25 convolutional filters as output, the second layer has 25 filters as input and 50 filters as outputs, each filter has a kernel size of 3 × 3.
After the convolutional layers, we have a fully connected layer with 28,800 units as input and 100 units outputs followed by the final layer with 100 units inputs and 10 units outputs which gives the final output prediction for the 10 classes expected by the Color FashionMNIST dataset.• Deep Neural Networks: this QCNN models consider the ResNet18 described in [9] and the ShuffleNet V2 presented in [10].
We consider as activation functions the ReLU and the function F given by (3.8), except in the last layer where we use a LogSoftmax activation.We employ the negative loglikelihood loss (NLLLoss) and Adam algorithm as optimisers.For the learning rate, we opted to use a dynamic value which is reduced when the loss metric has stopped improving, this is also known as ReduceLROnPlateau.As the initial learning rate values, we follow the guidelines from [16] and choose the value where the gradient towards the minimum loss value is steeper, which leaded to the following values: In Fig. 3 we present the loss vs learning rate in our models.In Fig. 4, we present the loss vs accuracy values per epoch for our models considering as activation functions the ReLU and F given in 3.8.
In Fig. 4, the dot-dashed (resp.the continuous) line corresponds to the evolution of the accuracy (resp.loss) per epoch.The blue (resp.orange) lines represent the training (resp.validation) results with the ReLU activation function, while the green (resp.red) lines represent the training (resp.validation) results for the parametric activation function F given in (3.8).By analyzing the plots, we observe that the Bessel activation functions outperform the ReLU in terms of accuracy and loss rates on the validation dataset.
Lastly, we provide confusion matrices for the various QCNN that were previously examined.These matrices demonstrate the accuracy achieved in matching the predicted labels with the true labels.
Upon examining the provided confusion matrices, it is evident that they align with the information presented in Fig. 4. Across all neural network architectures, notable accuracy levels are observed for classes 2, 5, 7, 8, and 9. Introducing ResNet18 further enhances the performance for the remaining classes, except for Class 6 (Shirt), which exhibits comparatively lower results.This outcome can be attributed to the utilization of a colorized version of the dataset and the relatively low resolution of the included images.

Conclusions
In this paper, we explore the advantages of activation functions with trainable parameters in the context of QCNN.Specifically, we focus on a new type of activation function of Bessel-type with trainable parameters.Our study demonstrates that by incorporating this activation function, we can achieve superior results compared to using the traditional ReLU activation function.The Bessel-type activation functions offer a unique combination of characteristics from both ReLU and sinusoid activation functions.Notably, when the parameter of the Bessel function is a positive half-integer, this special function reduces to a combination of trigonometric and polynomial functions.This allows the Bessel-type functions to capture complex patterns and relationships within the data.
Our numerical experiments indicate that the Bessel-type activation functions enable QCNN to achieve higher levels of accuracy more rapidly compared to the ReLU activation function.This highlights the effectiveness of trainable activation functions in improving the performance of quaternionic neural networks.In future research, it would be intriguing to extend the application of QCNN with Bessel-type activation functions to more classification tasks, such as the classification of clinical images.By harnessing the benefits of trainable activation functions in complex domains, we can further explore the potential of QCNN in various real-world applications.
.1) where c 1 , c 2 , c 4 , c 6 ∈ R, c 5 ∈ N 0 , c 3 , c 7 ∈ N, and the parameters in the hypergeometric function satisfy (2.4).The utilization of equation (3.1) allows for the approximation of any continuous function on a compact set, thanks to the numerous parameters involved.Furthermore, when convergence is assured, it becomes feasible to define specific parameter sub-ranges within equation (3.1

Fig. 2
Fig. 2 First 25 images of the Color FashionMNIST

Fig. 3
Fig. 3Loss and accuracy for different QCNN

Table 1
General activation function H (x)