Low-precision feature selection on microarray data: an information theoretic approach

The number of interconnected devices, such as personal wearables, cars, and smart-homes, surrounding us every day has recently increased. The Internet of Things devices monitor many processes, and have the capacity of using machine learning models for pattern recognition, and even making decisions, with the added advantage of diminishing network congestion by allowing computations near to the data sources. The main restriction is the low computation capacity of these devices. Thus, machine learning algorithms capable of maintaining accuracy while using mechanisms that exploit certain characteristics, such as low-precision versions, are needed. In this paper, low-precision mutual information-based feature selection algorithms are employed over DNA microarray datasets, showing that 16-bit and some times even 8-bit representations of these algorithms can be used without significant variations in the final classification results achieved. Graphical Abstract Graphical abstract


Introduction
The need for efficient algorithms has been one of the goals in Computer Science. But during the last years we have assisted also to the growing tendencies in sensoring and monitoring of activities and processes, and thus, among others, to what are called Big Data, on the one hand, and Internet of Things (IoT), in the other. These two tendencies have given birth to research areas on Cloud Computing or Edge Computing. Due to the increasing communication costs of sending/receiving data from and to the cloud, there is lately a growing interest in performing ever more complex machine learning tasks on mobile and embedded devices, frequently in real-time. Thus, the objective is to optimize the use of hardware resources and power consumption On the other hand, and regarding application fields, during the last few decades, the emergence of microarray datasets has stimulated a new line of research, both in bioinformatics and in machine learning. This type of datasets poses an interesting challenge because of two reasons: (i) they have very small samples-often less than 100 patients-in contrast to a very high dimensionality-the number of features ranges in the order of thousands; and (ii) it has been shown that most features are not necessary to an accurate classification [12], so it is paramount to discover the relevant features to gather an understanding of the process. Thus, FS has become a must-do in dealing with these datasets [6].
In a previous work, we have proposed a low-precision mutual information feature selection procedure [27]. Mutual Information (MI) comes from the field of Information Theory and it is widely used in both machine learning and statistics. As a matter of fact, it is part of the popular method mininum Redundancy Maximum Relevance (mRMR), which is known to work very well with microarray data [30]. To the best of our knowledge, ours is the first and only attempt to adapt feature selection to low-precision, despite the expected benefits that it could add to embedded systems for on-device analysis.
The goal of the work described inhere is to apply lowprecision mutual information feature selection on a challenging scenario: microarray data. Three different implementations will be tested (mutual information maximization, mRMR and joint mutual information), to check if the use of low-precision parameters is possible in datasets with such high dimensionality as microarrays.
The rest of the paper is organized as follows: Section 2 describes the state of the art of low-precision feature selection. Section 3 presents our low-precision mutual information approach. Section 4 describes the materials and methods used in the experiments, whose results are shown and analyzed in Section 5. Finally, Section 6 contains our concluding remarks and proposals for future research.

State of the art
With the growing amount of information being generated at the edge, the demand for machine learning models that can be deployed on edge devices has also increased. Although most of the effort has been put on adapting deep learning models to work on edge devices, there are some works that have developed techniques for distributed training or compression and pruning of other machine learning methods. Wang et al. [36] presented a technique to train machine learning methods at the edge that uses gradientbased approaches (e.g., SVMs, K-means, linear regression or CNNs). ProtoNN is an algorithm designed by Gupta et al. [13] based on kNN that projects data to a lower dimensional space using a sparse-projection matrix in order to reduce storage requirements. ProtoNN has shown to be only 1-2% less accurate while consuming 1-2 orders of magnitude less memory. Also based on reducing the model size is Bonsai [19], a tree-based algorithm that significantly outperforms state-of-the-art techniques in terms of model size, accuracy, speed, and energy consumption. Finally, the researchers in [22] investigated the effects of parameter quantization and of reduced working precision on the accuracy of floatingpoint SVM classification.
As mentioned above, much effort has been made to adapt deep learning algorithms for training or inference on the edge, as depicted in several review works [28,40]. One challenging option is to actually train the deep learning algorithms on the edge, for which federated learning is the most used approach [38]. Other works are focused on just deploying on the edge already trained models, so typical strategies are to reduce the number of trainable parameters and minimize the number of computations [17], or to reduce the size of the models by performing quantization 1 or model compression 2 [9,10].
Since edge-devices have limited computing power, energy consumption is a critical factor, so recent research trends show that much effort is being put into compressing neural networks. Several papers have attempted this approach through quantization, which is able to lower the memory footprint and potentially speed up the computations. In relation to inference accuracy, many studies have shown that it is possible to achieve the same results with reduced precision of weights and activations [14,24]. Regarding learning, Hubara et al. [18] introduced a method to train Quantized Neural Networks using extremely low precision and runtime activations, reaching an accuracy comparable to networks trained using 32 bits. The research of Yu et al. [39] presents a method of quantification with mixed data structure and proposes a hardware accelerator. This allows them to reduce the number of bits needed to represent neural networks from 32 to 5, also without affecting their accuracy. Banner et al. [3] introduced a 4-bit post training quantization approach with just a few percent accuracy degradation. Finally, the work of Sun et al. [33] shows that it is possible to train deep neural networks using only 4 bits with non-significant loss in accuracy while enabling significant hardware acceleration.
With regard to reducing energy consumption in feature selection, we can only find our own work in which we presented a limited bit depth mutual information that can be applicable to any feature selection method that uses internally the mutual information measure [25,27], which will be detailed in the following section.

Background
Mutual Information (MI) comes from the field of Information Theory and it is widely used in both machine learning and statistics. One of its main uses is feature selection methods, and in fully supervised data, the features X are ranked using this measure, and the ones finally selected are those having the highest mutual information with the class label Y . The mutual information is defined as the expected logarithm of a ratio: where p(x, y) = P r{X = x, Y = y} is the probability mass function of the joint distribution when the random variable X takes on the value x from its alphabet X and Y takes on y ∈ Y, while p(x) = P r{X = x} and p(y) = P r{Y = y} are the probability mass functions of the marginal distributions. In this work, the function is calculated in natural logarithm, so returned units are "nats". In practice we have to estimate this from data. This can be done by using the sample (maximum likelihood) estimates of the probabilitiesp and plug them in Eq. 1. This maximum likelihood estimator for the mutual information is consistent [29], and as a result we have: In order to calculate this we need the estimated distributionsp(x, y),p(x), andp(y). The probability of any particular event p(X = x) is estimated by maximum likelihood, the frequency of occurrence of an event X = x divided by the total number of events.
An illustrative example. Let us consider a vector Y with 961 observations, in which the number of occurrences of an event Y = y is 4. The probabilityp(y) will bep(y) = 4/961 = 0.004162330905307, which is approximately zero. For real applications, it is not necessary to store all the decimal digits, which makes mutual information an interesting measure to explore low precision. Besides, as the Internet of Things devices market matures, we will likely see a movement away from double-precision floating-point (i.e., 64-bit representation) to limited approaches using a lower number of bits.

Our approach
In information theoretic feature selection, the main challenge is to estimate the mutual information, for which it is necessary to estimate the probability distributions. Internally, it counts the occurrences of values within a particular group (i.e., its frequency). Based on Tschiatschek et al.'s [34] work for approximately computing probabilities, we investigated mutual information with limited number of bits by considering this measure with low-precision counters in a previous work [27]. Instead of the 64-bit resolution used typically by the standard hardware platforms, a fixedpoint representation was targeted with bi as the number of integer bits and bf as the number of fractional bits. The motivation to move to fixed-point arithmetic is twofold: (i) these bit representation compute units are typically faster and consume far less hardware resources and power than the conventional floating-point computations and (ii) lowprecision data representation reduces the memory footprint, enabling larger models to fit within the given memory capacity and lowering the bandwidth requirements.
Besides, since mutual information parameters are typically represented in the logarithmic domain, we compute the number of occurrences of an event and use a lookup table to determine the logarithm of the probability of a particular event. The lookup table is indexed in terms of number of occurrences of an event (individual counters) and the total number of events (total counter) and stores values for the logarithms in the desired low-precision representation. To limit the maximum size of the lookup table and the bit-width required for the counters, we assumed some maximum integer number M. The lookup table L is pre-computed such that: where [·] R denotes rounding to the closest integer, q is the quantization interval of the desired fixed-point representation (2 −bf ), ln(·) denotes the natural logarithm, and where the counters i and j are in the range {0, ..., M − 1}. Given certain specific data, the individual counters c i j and the population C are computed according to Algorithm 1. Following the fixed-point representation, we assumed some maximum integer number M, where M = 2 (bf +bi) − 1. After calculating the cumulative count C, we ensure that it is in range. Also, we divide by two the individual counters c i when C reaches its maximum value.

DNA microarray datasets
Microarray technology is used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosing diseases. During the last two decades, the advent of this type of datasets has stimulated a new line of research both in bioinformatics and in machine learning. Although there are usually very small samples (often less than 100 patients) for training and testing, the number of features in the raw data ranges from 2000 to 25,000. A typical classification task is to separate healthy patients from cancer patients based on their gene expression profile (binary approach). There are also datasets in which the goal is to distinguish among different types of tumours (multiclass aproach), making the task even more complicated. Therefore, microarray data poses a serious challenge for machine learning researchers. Having so many features relative to so few samples creates a high likelihood of finding false positives due to chance (both in finding relevant genes and in building predictive models). Thus, it becomes necessary to find robust methods to validate the models and assess their likelihood.
Besides, several studies have shown that most genes measured in a DNA microarray experiment are not relevant in the accurate classification of different classes of the problem [12]. To avoid the problem of the curse of dimensionality, feature selection plays a crucial role in DNA microarray analysis, so that the learning algorithm focuses only on those aspects of the training data useful for analysis and future prediction. Apart from the mismatch between dimensionality and sample size, microarray data have other particularities such as the imbalance of the data, their complexity, the presence of overlapping, or the so-called dataset shift [6]. Table 1 profiles the main characteristics of the 17 DNA microarray datasets used in this research in terms of the number of samples, features and classes [2,7,26,32].

MI-based feature selection methods
Mutual information definition is useful within the context of feature selection because it gives a way to quantify the output vector. Thus, there exist in the literature several feature selection methods based on mutual information measures. Most methods define heuristic functionals to assess feature subsets combining definitions of relevant and redundant features. Among the different information theoretic methods, we have chosen three to evaluate our low-precision mutual information approach, each of them making different assumptions. For example, Mutual Information Maximization quantifies only the relevancy, minimum Redundancy Maximum Relevance the relevancy and redundancy, while the Joint Mutual Information the relevancy, the redundancy and the complementarity [8].
-Mutual Information Maximization (MIM) [23] ranks the features by their mutual information score, and selects the top k features, where k is decided by some predefined need for a certain number of features or some other stopping criterion. An important limitation is that this assumes that each feature is independent of all other features and effectively ranks the features in descending order of their mutual information content. Thus, this approach does not take into account the redundancy between the features. -minimum Redundancy Maximum Relevance (mRMR) [30] feature selection method selects features that have the highest relevance with the target class and are also minimally redundant, i.e., it selects features that are maximally dissimilar to each other. Both optimization criteria (maximum-relevance and minimum-redundancy) are based on mutual information. -Joint Mutual Information (JMI) [37] is another feature selection method based on mutual information, and it adopts a new criterion to evaluate the candidate features. JMI chooses the feature that has the maximum cumulative summation of joint mutual information with the selected features in each step and adds it to the subset S until the number of selected features reaches k.

Results
In this section we empirically evaluate our low-precision mutual information method described in Section 3. Among the different methods that use internally the mutual information measure, we have chosen feature selection since this process has a key role to play in helping to identify the specific genes that enhance classification accuracy in DNA microarray data. As said above, there is a large number of feature selection methods that use mutual information as a metric to establish the importance of the features, thus their performance depending on the accuracy obtained by the mutual information step. In this work, we have implemented our limited bit depth mutual information in the MIM, mRMR and JMI filters methods due to their popularity and good results in the machine learning area. In order to estimate mutual information of continuous features, the DNA microarray datasets were discretized, using an equal-width strategy into 10 bins. After the feature selection process the original (undiscretized) datasets were used to classify the test data. In the following sections, we investigate the questions: "how similar are the rankings obtained by the different low-precision MI-based feature selection approaches?" and "which is the impact of these rankings on classification?". To address these questions, we use the 17 DNA microarray datasets detailed in Table 1. Experiments were executed in the Matlab2020a and Weka [15] environments, using default values for the parameters.

How similar are the rankings obtained by the different low-precision MI-based feature selection approaches?
In this subsection, we will evaluate the similarity between the feature rankings obtained by the 64-bit mutual information and the low-precision versions (using fixed point representations with 4, 8, 16 and 32 bits) after performing the MIM, mRMR and JMI feature selection methods. To address this study, we show the true positive rate (TPR), which measures the proportion of features that are correctly identified as such, using the full mutual information version (64 bits) as the ideal ranking. In high dimensional datasets, like DNA microarray data, it is common to focus only on the top features, so in these experiments we compared only the k top features, with k = 5, 10, 20, 30, 40 and 50. As can be seen from the experimental results illustrated in Table 3, the lowest values of the low-precision approach using 4 bits show that the correlation between its selected features and the ideal ranking is quite poor in the three information theoretic methods. However, from 8 bits on, all the approaches achieved a TPR close to 1, which means that the features selected by these low-precision approaches are very similar to those selected by the full version using 64 bits. It can also be observed that, in general, by increasing the number of selected features, the TPR is higher.
Trying to understand the possible effect that the size of the datasets could have on our results, we analyzed the TPR in two different DNA microarrays: Colon (62 samples and 2000 features) and Ovarian (253 samples and 15,154 features). As can be seen in Figs. 1 and 2, as the number of samples and features of the dataset increases, the performance of our low-precision version using 8 bits decreases. Regarding the 4-bit low-precision version, it achieved higher values of TPR in Ovarian dataset. This could be happening because, despite the fact that the Ovarian dataset clearly has a greater number of features, it also presents higher values of mutual information than in thev case of the Colon dataset  Fig. 3). Remember that, in terms of maximum relevance, the selected features are individually required to have the largest mutual information with the class label, reflecting the largest dependency on the target class. Finally, we compared the results between the different feature selection methods. It is worth noticing that the univariate filter MIM, which takes into account only the individual relevance of each feature, performs better than the multivariate filters mRMR and JMI, which take into account feature dependencies. The information loss when reducing the number of bits affects the results much more than in the case of the less complex univariate methods. Besides, it can be seen that JMI performs better-in some cases-than MIM and mRMR when 8 bits are used. This could be because JMI criterion has the best trade-off in terms of stability and flexibility over other feature selection methods based on Information Theory due to its nature (it balances the relevancy and redundancy terms and includes the conditional redundancy) [8].

Which is the impact of these rankings on classification?
Once feature selection has been carried out, and in order to estimate whether the low-precision mutual information in the MIM, mRMR and JMI methods might affect classification, a study using two classifiers belonging to different families was performed. At this point, it is necessary to clarify that including classifiers in our experiments is likely to obscure the experimental observations related to feature selection performance using a limited number of bits, since they have their own assumptions and particularities. It has been shown that certain classifiers can obtain outstanding accuracy levels even when the feature ranking is not optimal [5]. Therefore, in these experiments, we used a simple nearest neighbor algorithm (with number of neighbors k = 3) [1], since it makes few assumptions about the data and we avoid the need for parameter tuning, and a linear support vector machine (SVM) [35], due to its superiority in performance over other classifiers in this specific domain of microarray datasets [6,16], as well as a boosting algorithm (LogitBoost) [11] . To estimate the error rate we computed 3 × 5-fold cross-validation (i.e., 3 repetitions of a cross-validation with 5 folds), including both feature selection and classification steps in a single cross-validation loop [21]. Tables 4, 5    For each classifier and number of features, highest accuracy rates highlighted in bold For each classifier and number of features, highest accuracy rates highlighted in bold mRMR and JMI feature selection methods, respectively. As can be seen for the three different information theoretic methods, the 8, 16 and 32 low-precision versions achieved very competitive results-in some cases even better-than the baseline 64-bit approach. Besides, we can see that the classification accuracy improves as the number of features increases. Remember that, in the case that the top 50 features are selected, the number of features used to train the model will be not even 3% of the number of features in the original microarray dataset.
To explore the statistical significance of our classification results, and due to the drawbacks of the traditional tests of For each classifier and number of features, highest accuracy rates highlighted in bold contrast of the null hypothesis pointed up by [4], we have chosen to apply the Bayesian hypothesis test [20]. In this type of analysis, a previous step is needed, which consists in the definition of the Region of practical equivalence (Rope). Two methods are considered practically equivalent in practice if their mean differences given a certain metric are less than a predefined threshold. In our case, we will consider two methods as equivalent if the difference in error is less than 1%. For the whole benchmark and each pair of methods, we calculated the probability of the three possibilities: (i) low-precision version wins over full version (64-bit) with a difference larger than rope, (ii) full version wins over low-precision with a difference larger than rope, and (iii) the difference between the results are within the rope area. If one of these probabilities is higher than 95%, we consider that there is a significant difference. Figures 4, 5 and 6 show the distribution of the differences between each pair of methods using simplex graphs. Since analyzing specific aspects related to classification is not the goal of this paper, we only show the results for the 3-NN classifier (because it makes less assumptions about the data than SVM and LogitBoost). As can be seen, regardless of the feature selection method, the low-precision versions with 8, 16 and 32 bits are practically equivalent to the 64-bit baseline version (the highest probability values are obtained by rope). In the case of the 4-bit version, and as  we have been observing in the results obtained so far, here there is statistical significance with respect to the 64-bit version, since the probability that the full approach using 64 bits wins over the 4-bit-represented in the figures as p(64-bit)-is greater than 95% in all the cases.
Finally, Table 7 shows the runtime required by the three classification algorithms. In terms of classification accuracy, the best results were obtained by the SVM classifier. However, in the case of comparing them by their computational time, a good choice would be the 3NN classifier. This model has a slightly lower accuracy than the other two classifiers, but requires less than 1/2 of the time to classify. In addition, it can be observed how the computation time increases in the microarray datasets with the largest number of samples and classes (i.e., 9-tumors, 11-tumors, Brain-tumor-1 and Lung-cancer).
To sum up, these experimental results show that, with a small number of bits (32, 16 and even 8) the rankings change, but this variation does not affect significantly the classification performance, since this measure is the ultimate form of evaluation of the goodness of a ranking feature selection method. However, this method has also some drawbacks. If there is a short distance between the population values of the mutual information, our lowprecision approach will not be adequate. Besides, we will require additional bits as the number of features/samples of the dataset grows. Nevertheless, it is worth noting that our low-precision technique was created to evaluate data at the user level. In the case of dealing with large data, most likely these will be acquired from a variety of sources, and it will be processed either by more powerful central processors or disseminated over multiple nodes for further analysis.  Runtime is calculated as the average of the 3 repetitions of a cross-validation with 5 folds

Conclusions
Driven by the proliferation of mobile computing and Internet of Things, in this work we have applied mutual information using low-precision parameters within a feature selection procedure. The obtained results over 17 microarray datasets demonstrated that 8-bit representations were sufficient to obtain feature rankings similar to those of double floating-point precision parameters and thus opening the door for the use of feature selection in Internet of Things devices that minimize the energy consumption and carbon emissions. Regarding the three feature selection methods used to test our low-precision mutual information, we have found that MIM was the most appropriate for this challenging scenario, taking into account not only its performance in classification but also its computational complexity.
As future research, we plan to develop other feature selection methods in low-precision, such as those based on distances (ReliefF) or on correlations. It would be also interesting to apply other strategies to represent data with a low number of bits, such as dynamic fixed point, and different techniques for rounding.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.