A classification algorithm based on multi-dimensional fuzzy transforms

We present a new classification algorithm for machine learning numerical data based on direct and inverse fuzzy transforms. In our previous work fuzzy transforms were used for numerical attribute dependency in data analysis: the multi-dimensional inverse fuzzy transform was used to approximate the regression function. Also here the classification method presented is based on this operator. Strictly speaking, we apply the K-fold cross-validation algorithm for controlling the presence of over-fitting and for estimating the accuracy of the classification model: for each training (resp., testing) subset an iteration process evaluates the best fuzzy partitions of the inputs. Finally, a weighted mean of the multi-dimensional inverse fuzzy transforms calculated for each training subset (resp., testing) is used for data classification. We compare this algorithm on well-known datasets with other five classification methods.


Introduction
In this paper we propose a new classification algorithm, called multi-dimensional F-transform classification (for short, MFC), in which the direct and inverse multi-dimensional fuzzy transforms are used to classify instances.
Our goal is to build a robust classification model based on the multi-dimensional fuzzy transform, which integrates a K-fold cross validation resampling method to overcome the problem of data over-fitting, which represents one of the major problems of classification models.
Classification tasks (Aggarwal 2014;Duda et al. 2001;Han et al. 2012;Hastie et al. 2013;Johnson and Wichern 1992;Mitchell 1997;Mitra et al. 2002;Nirmalraj and Nagarajan 2020;Mitra et al. 2002) consist of assigning patterns to only one of predefined categories, and it is a general problem that encompasses many diverse applications.Many techniques are used for data classification such as probabilistic and decision tree algorithms, rule-based and instance-based methods, support vector machine and neural networks algorithms (Aggarwal 2014;Witten et al. 2016).
Classification algorithms require a training phase to acquire the knowledge to be fitted in the related model and to classify patterns.Then a testing phase follows in which the performance of the algorithm is measured as well.
A good machine learning classification algorithm must be robust with respect to data under-fitting and over-fitting.Under-fitting occurs when the algorithm fails to fit sufficiently the data, producing high training and test errors, thus an under-fitting machine learning algorithm produces errors on the training data.The presence of under-fitting is evaluated by measuring a performance index after the learning process.Over-fitting represents a problem versus under-fitting and it occurs when a machine learning algorithm captures noise in the data and the data in the training dataset are optimally fitted, but the fitting itself is less accurate.The evaluation of the presence of over-fitting is more complex.There are two main techniques that we can use to limit overfitting in machine learning algorithms: to measure the accuracy by using a validation dataset or to adopt a resampling technique by using random subsets of data.
After selecting and tuning the machine learning algorithm on the training dataset, we can evaluate the learned models on the validation dataset and hence to measure how the models could perform on unseen data.
In Sect. 2 we give the preliminary concepts.In Sect. 3 we recall the multi-dimensional F-transform, in Sect. 4 the MFC algorithm is presented.In Sect. 5 we present the results of our experiments: we apply the MFC algorithm in many datasets known in literature, comparing it with other classification methods.Conclusions are given in Sect.6.

The MFC algorithm
Now we model the input data as a collection of instances.Each instance is characterized by a pair (X, Y), where X is a set of numerical attributes (X 1 ,…,X s ) and Y is a special attribute designated as class which has C categories.We apply the fuzzy transform (for short, F-transform) algorithm (Perfilieva 2006) for finding a relation between attributes in the form: . The F-transform algorithm is a technique used to approximate an original function "f", of which a priori only a finite number values is known, but the expression of the function itself is unknown.The F-transform concept has been used in image processing (Di Martino and Sessa 2007, 2019a, b, 2020;Di Martino et al. 2008), data analysis (Di Martino et al. 2010Martino et al. , 2011;;Di Martino and Sessa 2017a, b, 2019a, b, 2020;Novak et al. 2014;Formato et al. 2014;Perfilieva et al. 2008), time (1) Y = f x 1 , … , x s series analysis (Di Martino et al. 2011;Di Martino and Sessa 2017a, b;Johnson and Wichern 1992), forecasting problems (Di Martino et al. 2011;Di Martino and Sessa 2017a, b) and fuzzy approximation method (Tanaka et al. 1982;Khastan et al. 2015Khastan et al. , 2017)).
In Lee and Yen (2004), a classification based on the dependency between attributes in datasets is proposed.Indeed, in the paper (Di Martino and Sessa 2017a, b) an attribute dependency based on the inverse multi-dimensional F-transform is presented as regression function and in the work (Di Martino et al. 2011) as prediction function in time series.In the algorithms based on F-transforms, a set of fuzzy partitions of the input data domains is created.An assigned family of n i fuzzy sets is considered for any input variable.We say that the training dataset is sufficiently dense w.r.t. the fuzzy partition if, for each combination of fuzzy sets of the input variables, exists at least an instance with positive membership degree.We illustrate an example in the case of two attributes defined in the universe of discourse respectively, (cfr.Sect.3) and they stand for the instances of the training dataset represented from red points in the example of Fig. 1.We deduce that in this example the data are not sufficiently dense w.r.t. the chosen fuzzy partitions: indeed in the grey zone there are no instance p 1i and p 2j such that x 1,i − 1 < p 1i < x 1,i + 1 and x 2,j − 1 < p 2j < x 2,i + 1 .The case of Fig. 1 is due to the too fine-grained fuzzy partition of input variable domain in the case of a family of two fuzzy sets A 1 .The model fails to fit the output for input data in the grey zone of Fig. 1.
The F-transform attribute dependency algorithm (for short, FAD), presented in Di Martino and Sessa (2007), controls the presence of under-fitting and it verifies that the dataset is not sufficiently dense with respect to the fuzzy partitions of the input variable domains.The FAD algorithm is an iterative process in which a uniform fuzzy partition with n fuzzy sets is created for each input variable.If the data are not sufficiently dense w.r.t. the fuzzy partition, the process stops and the functional dependency cannot be found, otherwise the direct and inverse multi-dimensional F-transforms are calculated.Afterwards an index of determinacy, called MADMEAN (Di Martino and Sessa 2017a, b) is calculated; this index measures if the functional model created fits the data and it ranges in [0,1].The value of MADMEAN is closer to 1, the more the model fits the data.If MAD-MEAN is greater than a threshold value α, the process stops and the inverse multi-dimensional F-transform is used to approximate the functional dependency, else a finer fuzzy partition is set with n: = n + 1.In Fig. 2 we illustrate the FAD algorithm.
However, even if the training data is sufficiently dense w.r.t. the fuzzy partition, the presence of over-fitting cannot prevented.Hence we propose a new iterative classification algorithm based on multi-dimensional F-transforms in which we apply the K-fold (K is a positive integer) cross-validation resampling algorithm to control this presence.
The K-fold cross-validation is the most popular resampling technique: it allows to train and to test the model K times on different subsets of the training dataset, moreover an estimation of a machine learning model on unseen data is obtained as well.
Performance analysis of this classification algorithm is pointed out in the works (Wong 2015(Wong , 2017)).In the paper (Wen et al. 2017), a new SVM K-fold cross-validation method is proposed to increase the efficiency of such approach.
A K-fold cross-validation algorithm works under K iterations.A partition of the sample dataset into K folds (i.e., Fig. 2 FAD algorithm subsets) is done.At any iteration a fold is considered as validation subset which is joined to the other folds for constituting the training set and then the classification process is performed.The main advantage of this method w.r.t.other resampling techniques is that each fold is considered once as validation set.In Fig. 3 we schematize this technique for K = 4.
As in FAD algorithm (Di Martino and Sessa 2017a, b), by using a K-fold cross-validation technique, we can calculate two indices (say) CV 1 (resp., CV 2 ) for evaluating the performance of our algorithm.These indices are given by the average of the percentages of misclassified instances in the training data CV k 1 (resp., testing data CV k 2 ) for the kth fold, k = 1,…,K: (2) We set a threshold α (resp., β) for CV 1 (resp., CV 2 ), then the training (resp., testing) set is partitioned in K folds and a set of fuzzy partitions is created for the input attributes, verifying that each subset is sufficiently dense w. r. t. the set of fuzzy partitions.Then the direct F-transforms are calculated in each fold.The average of the inverse F-transforms calculated for each fold is useful to classify an instance.Strictly speaking, if is the inverse F-transform for x 1 = p 1 ,…,x s = p s , being n 1 ,…,n s are the number of fuzzy sets of the partition, respectively, we classify the corresponding variable Y in the kth fold of the training (resp., testing) as The class label assigned to the output variable is given by ( 4) where [a] stands for the greatest integer containing the positive real number a.Then, we calculate the two indices CV 1 and CV 2 .If CV 1 ≥ α or CV 2 ≥ β, the process is iterated by considering a finer set of fuzzy partitions, otherwise the process stops.Summarizing, the MFC algorithm consists into the following steps: 1.The number of folds K and the thresholds α and β for the two indices CV 1 and CV 2 are defined.2. The patterns formed by the input attributes X 1 ,…,X s , and the output Y are extracted from the dataset.The input attributes are numerical attributes and the output attribute contain the class label assigned to each instance.3. The set of instances is randomly partitioned in K folds of equal dimension, by using the schema of Fig. 3. 4. For each input attribute X i , i = 1,…,s, a uniform fuzzy partition is created with the basic functions A i1 , A i2 , … , A in i . 5. For each fold we verify that the corresponding training subset is sufficiently dense w.r.t. the set of fuzzy partitions.If for at least onefold the sufficient density is not respected, the process stops and the classification is not found.6.For each fold the direct fuzzy transform (4) is calculated.7.For each fold the index CV k 1 , k = 1,…, K, is calculated as the number of all the misclassified patterns in the training subset, where the classification is performed by using the average (5).8.For each fold the index CV k 2 , k = 1,…,K, is calculated as the number of all the misclassified patterns in the testing subset, where the classification is performed by using the average (5).9.The indices CV 1 and CV 2 are calculated via Eqs.( 2) and (3), respectively.If CV 1 ≥ α or CV 2 ≥ β, a finer fuzzy partition set is considered and the process returns to the step 3. 10.The classification is validated.
In Fig. 4 we schematize the MFC algorithm.

Multi-dimensional F-transforms
Following the definitions and notations of Perfilieva (2006), let n ≥ 2 and x 1 , x 2 ,…, x n be points of [a,b], called nodes, such that x 1 = a < x 2 < … < x n = b.The assigned family of fuzzy sets A functions, is a fuzzy partition of [a,b] if the following properties hold: (1) A i (x i ) = 1 for every i = 1,2,…,n; (2) A i (x) = 0 if x ∉ (x i − 1 ,x i + 1 ) for i = 2,…,n; (3) A i (x) is a continuous function on [a,b]; (4) A i (x) strictly increases on [x i − 1 , x i ] for i = 2,…, n and strictly decreases on [x i ,x i + 1 ] for i = 1,…, n − 1; (5 We say that the fuzzy sets {A 1 (x),…,A n (x)} form a uniform fuzzy partition if (6) n ≥ 3 and x i = a + h•(i − 1), where h = (b − a)/(n − 1) and i = 1, 2,…, n (that is, the nodes are equidistant); (7 Avoiding the case of a continuous function studied in the paper (Perfilieva 2006) and considering the discrete case which is applicable in many real situations, we know that the function f: [a,b] where each F i is given by for i = 1,…,n.Similarly, we define the inverse F-transform of f w.r.t.{A 1 , A 2 ,…, A n } by setting for every j ∈{1,…,m}.We have the following theorem (Per- filieva 2006): Theorem 1 Let f(x) be a function assigned on a set P = {p 1 ,…, p m }⊂ [a,b] and assuming values in [0,1].Then for every ε > 0, there exists an integer n(ε) and a related fuzzy partition {A 1 , A 2 , …, A n(ε) } of [a,b] for which there P is sufficiently dense.Moreover, for every p j ∈ [a,b], j = 1,…,m, the following inequality holds: The above concepts can be extended to functions in k (≥ 2) variables.Let the Cartesian product be the universe of discourse.In the discrete case, the function f x 1 , … , x s assumes determined values in m points We say that the set P = {(p 11 , p 12 ,…,p 1s ), (p 21 , p 22 ,…, p 2s ),…,(p m1 , p m2 ,…,p ms )} is sufficiently dense w.r.

Proposed algorithm
Continuing to maintain notations given in Sect. 1, we consider a dataset formed by s attributes , and an attribute Y ∈ {1,…,C}.We try to find a dependency between attributes in the form (1), where f is a discrete function f: In the MFC algorithm we apply the K-fold validation algorithm by partitioning randomly the dataset in K folds having the same number of instances.Each fold is formed by a training subset with m = K−1 K ⋅ M instances and by a testing subset with where p ji is the value of the attribute X i for the instance O j .Each attribute X i can be considered as a numerical variable assuming values in the domain [a i ,b i ], where a i = min{p 1i ,…,p mi } and b i = max{p 1i ,…,p mi }.The parameters (to be fixed) are the number K of folds and the threshold α (resp., β) for CV 1 (resp., CV 2 ).We first construct (11) a uniform fuzzy partition for the domain of each input attribute [a i ,b i ] constituted by following n i basic functions defined as where h i = (b i − a i )/(n i − 1), x ij = a i + h i •(j − 1), i = 1,2,…,s and j = 1,2,…,n i .Initially we consider the coarsest grained uniform fuzzy partition by setting n 1 = n 2 = … = n s = 3. Afterwards in each fold, we control that the data are sufficiently dense w.r.t. the fuzzy partition, that is we verify that for any combination of basic functions A 1h 1 , … , A kh s , there exists at least one instance O j such that If this condition is not satisfied in any fold, the process stops and the classification is not found, otherwise the direct F-transform is calculated in each fold.By setting f(p j1 ,p j2 ,…,p js ) = y j for j = 1,2,…,m, the direct F-transform of f is constructed by using Eq. ( 9) in the following form: where k = 1,…,K.Given a set of values p 1 ,.,p i ,..,p s for the input variables X 1 ,…,X i ,…,X s , we calculate the inverse F-transform f F k n 1 n 2 …n s for each fold, defined as The algorithm is schematized below in the form of pseudocode.The function Classify() return the class y f assigned to a pattern with input values p = (p 1 ,…,p s ) by using the inverse F-transform.The MFC algorithm compares the resulting class with the value y of the attribute Y assigned in the dataset to calculate the numbers of misclassified patterns CV k 1 and CV k 2 , for the kth training and testing subset, respectively, and to calculate the two indexes CV 1 and CV 2. (13) Algorithm MFC

Arguments:
α, β, K Return value: Boolean values (TRUE = "classification found", FALSE = "classification not found") Input: Dataset composed by m instances with input attributes X 1 , X 2 ,…,X s and class attribute Y Output: Direct F-transform components 1 status: = FALSE 2 n 1 : = 3,…,n S : = 3 3 WHILE (status == FALSE) 4 Partition randomly the dataset in K subsets creating the K folds 5 FOR i = 1 to s 6 Create a uniform fuzzy partition { A i1 , A i2 , … , A in } of the interval [a i ,b i ] by using the basic functions (13) 7 NEXT i 8 FOR k = 1 to K 9 IF the kth training subset is not dense w.r.t. the fuzzy partition product THEN 10 RETURN status // "Classification not found" 11 ELSE 12 Calculate the direct F-Transform components by using Eq. ( 9 (p 1 , p 2 , … , p s ) by using Eq. ( 15)
In each experiment we randomly select a set of instances to be stored in a testing dataset.The other instances are randomly partitioned in K folds.We set K = 10, α = 2%, β = 4%.After application of the MFC algorithm, we use the classifier on the testing dataset obtaining the error as percentage of misclassified instances.
For brevity, we present the detailed results obtained only for four datasets: IRIS, balance-scale, banana and wine.Now present the results for IRIS dataset which is formed by 150 instances and 5 attributes.The dataset consists of 50 samples, each of three species of plant: Iris setosa, Iris versicolor and Iris virginica.For each instance the length and the width in cm of the sepals and petals are measured.Based on the combination of these four features, the biologist R. Fisher (https:// en.wikip edia.org/ wiki/ Iris_ flower_ data_ set) developed a linear discriminant model to distinguish the species from each other.It is well known that only the Iris setosa is well linearly separable from the other two.
In Table 1 we show the attributes in the IRIS dataset.
We use a testing dataset of 50 instances downloaded from the IRIS WEB page project (http:// netco logne.dl.sourc eforge.net/ proje ct/ irisd ss/ IRIS% 20Test% 20data.txt).Table 3 shows the MFC classifies improperly the data only in one case.
Now we present the results obtained by applying the MFC algorithm on the balance-scale dataset: the attributes are the left weight, the left distance, the right weight and the right distance (Table 4).
The dataset is composed of 625 instances.We select randomly 600 instances, storing the other 25 instances for testing the classification results, we set K = 10 and each fold contains 540 (resp., 60) training (resp., testing) instances.In Table 5, we show the results obtained for each fold.
We apply the classification on the testing dataset of 25 instances.In Table 6, we show the results obtained.The instance is improperly classified only in one case.
We present a third experiment in which we apply the MFC algorithm to the Keel dataset Banana dataset by the Fraunhofer Intelligent Data Analysis Group (https:// www.iais.fraun hofer.de/).It contains instances formed by three attributes: the first two ones represent the X and Y axis of the cluster of banana's form, the last attribute assumes values − 1 and 1 representing two forms of banana (Table 7).
The dataset contains 5300 instances.We select randomly 300 instances for the final testing phase and we partition randomly the other 5000 instances creating tenfold.Each fold consists of 4500 (resp., 500) training (resp., testing) instances.In Table 8, we show CV 1 and CV 2 obtained for each fold and their averages .In Table 9, we show the results obtained on the testing dataset.
Finally, we show the results of the tests performed on the UCI machine learning Wine dataset.This dataset is given by 178 instances having 13 numerical features characterizing the chemical composition of an Italian wine.Each wine belongs to one of three different crops; the  dataset is partitioned in three classes identified as 1, 2 and 3 whose belong 59, 71, 48 instances, respectively.We select randomly 150 instances, storing the other 28 instances for testing the classification results.We construct K = tenfold containing 150 (resp., 28) training (resp., testing) instances.In Table 10 (resp., 11), we show the values of CV 1 and CV 2 obtained for each fold and their averages (resp., testing dataset) (Table 11).
For completeness, we present a comparison of the MFC algorithms with classification algorithms implemented in the mining tool WEKA (Waikato Environment for Knowledge Analysis, https:// www.cs.waika to.ac.nz/ ml/ weka/; Witten et al. 2016).For each algorithm, we set K = 10.In Table 12, we show the mean running time (in ms) and the testing errors obtained by applying the MFC algorithm and the classification algorithms Decision tree-based J48 (Bhargawa et al. 2013;Mitchell 1997), Multi-Layer Perceptron (Pal andMitra 1992, 2004;Chaudhuri and Bhattacharya 2007), naive Bayes (Dimitoglou et al. 2012;Panda and Matra 2007) and Lazy K-Nearest Neighbor IBK (Aha 1997;Bhargawa et al. 2013;Jiang et al. 2007).In Table 13, we show the mean running time, the mean and the standard deviation of the testing percentage errors obtained by applying the five classification algorithms to all the datasets used in our tests.
These results show that the performance of the MFC algorithm, obtained by setting α = 2% and β = 4%, are comparable with the ones obtained applying the Decision tree J48 and the Multilayer Perceptron algorithms.Moreover, MFC produces better performance results than the Lazy IBK and naive Bayes algorithms, although it gives slightly higher running times.
In Table 14, we show the mean accuracy, precision and recall classification measures obtained by running the five algorithms on all the test datasets.These results show that MFC, in addition to producing accurate results, also has high precision and high sensitivity, comparable to those obtained using the Decision tree J48 and Multilayer Perceptron algorithms.

Conclusions
In this paper we propose a new classification algorithm based on direct and inverse F-transforms for machine learning data.We apply the K-fold cross-validation resampling algorithm to control the presence of over-fitting in the data.The MFC algorithm is an iterative algorithm in which the dataset is partitioned randomly in K subsets, then the algorithm calculates the performance indices CV k 1 (resp., CV k 2 ) for each fold: they correspond to the percentage of misclassified instances in the training (resp., testing) subsets.Initially a coarse-grained fuzzy partition of the data domain is set: if the two indices are greater or equals to two specified thresholds for each fold, then the algorithm stops otherwise a finer fuzzy partition of the data domain is set.
We compare the MFC algorithm with other known classification algorithms implemented in the mining tool WEKA.The results obtained for over 100 classification datasets show that the performances the MFC algorithms are better than the ones obtained by using the naive Bayes and Lazy IBK algorithms and they are comparable with the ones obtained by the Decision tree J48 and the Multilayer Perceptron algorithms.

Fig. 1
Fig. 1 Example of training dataset not sufficiently dense w.r.t. the fuzzy partitions

Fig. 3
Fig. 3 Schema of a fourfold cross-validation technique

Fig. 4
Fig. 4 Schema of the MFC algorithm each (p,y) in the k-th learning subset 20 y f = Classify(C,p) // class assigned to the pattern p 21 IF(y f != y) THEN 22 C 1k : = C 1k + 1 23 END IF 24 CV 1 : = CV 1 + CV k 1 25 CV k 2 : = 0 26 FOR each (p,y) in the k-th testing subset 27 y f = Classify(C,p) // class assigned to the pattern p 28 IF(y f != y) THEN 29 CV k 2 , … , A sn s as Now we define the inverse F-transform of f w.r.t. the basic functions A 11 , A 12 , … , A 1n 1 ,…, A s1 , A s2 , … , A sn s to be the following function by setting for each point (p j1, p j2 ,…,p js

Table 1
Attributes in the IRIS dataset

Table 3
Test results obtained for a sample of 50 instances

Table 4
Attributes in the balance scale dataset

Table 6
Test results obtained for a sample of 25 instances

Table 12
Running time, misclassified instances and testing percentage errors obtained for the datasets IRIS, balance scale, banana and wine by using different classification algorithms

Table 13
Mean running time and mean and standard deviation testing percentage error obtained by using the 5 classification algorithms