Attribute dependency data analysis for massive datasets by fuzzy transforms

We present a numerical attribute dependency method for massive datasets based on the concepts of direct and inverse fuzzy transform. In a previous work, we used these concepts for numerical attribute dependency in data analysis: Therein, the multi-dimensional inverse fuzzy transform was useful for approximating a regression function. Here we give an extension of this method in massive datasets because the previous method could not be applied due to the high memory size. Our method is proved on a large dataset formed from 402,678 census sections of the Italian regions provided by the Italian National Statistical Institute (ISTAT) in 2011. The results of comparative tests with the well-known methods of regression, called support vector regression and multilayer perceptron, show that the proposed algorithm has comparable performance with those obtained using these two methods. Moreover, the number of parameters requested in our method is minor with respect to those of the cited in the above two algorithms.


Introduction
Data analysis and data mining knowledge discovery processes represent powerful functionalities that can be combined in knowledge-based expert and intelligent systems in order to extract and build knowledge starting by data. In particular, attribute dependency data analysis is an activity necessary to reduce the dimensionality of the data and to detect hidden relations between features. Nowadays, in many application fields, data sources are massive (for example, web social data, sensor data, etc.), and it is necessary to implement knowledge extraction methods that can operate on massive data. Massive (Very Large (VL) and Large (L)) datasets (Chen and Zhang 2014) are produced and updated and they cannot be managed by traditional databases. Today, access via the Web to these datasets has led to develop technologies for managing them (cfr., e.g., (Dean 2014;Leskovec et al. 2014;Singh et al. 2015)).
Machine learning soft computing models were proposed in the literature to perform nonlinear regressions on high dimensional data; two well-known machine learning nonlinear regression algorithms are support vector regression (SVR) (Drucker et al. 1996) and multilayer perceptron (MLP) (cfr., e.g., (Collobert and Bengio 2004;Cybenko 1989;Hastie et al. 2009;Haykin 1999Haykin ,2009Murtagh 1991;Schmidhube 2014)) algorithms. The main problems of these algorithms are the complexity of the model due to the presence of many parameters to be set by the user, and the presence of overfitting, phenomenon in which the regression function fits optimally the training set data, but fails in predictions on new data. K-fold cross-validation techniques are proposed in the literature to avoid overfitting (Anguita et al. 2005). In Thomas and Suhner (2015), a pruning method based on variance sensitivity analysis is proposed to find the optimal structure of a multilayer perceptron in order to mitigate overfitting problems. In Han and Jian (2019), a novel sparse-coding kernel algorithm is proposed to overcome overfitting in disease diagnosis.
Some authors proposed variations of nonlinear machine learning regression models to manage massive data. In Cheng et al. (2010), Segata and Blanzieri (2009) a fastlocal support vector machine (SVM) method to manage large datasets are presented in which a set of multiple local SVMs for low-dimensional data are constructed. In Zheng et al. (2013), the authors proposed an incremental version of the vector machine regression model to manage largescale data. In Peng et al. (2013) the authors proposed a parallel architecture of a logistic regression model for massive data management. Recently, variations of the extreme learning machine (ELM) regression methods for massive data based on the MapReduce model are presented (Chen et al. 2017;Yao and Ge 2019).
The presence of a high number of parameters makes SVR and MLP methods too complex to be integrated as components into an intelligent or expert system. In this research, we propose a model of attribute dependency in massive datasets based on the use of the multi-dimensional fuzzy transform. We extend the attribute dependency method presented in Martino et al. (2010a) to massive datasets in which the inverse multi-dimensional fuzzy transform is used as a regression function. Our goal is to guarantee a high performance of the proposed method in the analysis of massive data, maintaining, at the same time, the usability of the previous multi-dimensional fuzzy transform attribute dependency. As in Jun et al. (2015), we use a random sampling algorithm for subdividing the dataset in subsets of equal cardinality.
The fuzzy transform (F-transform) method (Perfilieva 2006) is a technique which approximates a given function by means of another function unless an arbitrary constant. This approach is particularly flexible in the applications such as image processing (cfr., e.g., (Martino et al. 2008(Martino et al. ,2010b(Martino et al. ,2011bSessa 2007,2012)), data analysis (cfr., e.g., (Martino et al. 2010a(Martino et al. ,2011aPerfilieva et al. 2008)). In this last work, an algorithm, called FAD (F-transform Attribute Dependence), evaluates an attribute Xz depending from k attributes X 1 …X k. (predictors) with z 6 2 {1,2,…k} , i.e. X z = H(X 1 …X k ), and the (unknown) function H is approximated with the inverse multi-dimensional F-transform via a procedure presented in Perfilieva et al. (2008). The error of this approximation in Martino et al. (2010a) is measured from a statistical index of determinacy (Draper and Smith 1988;Johnson and Wichern 1992). If it overcomes a prefixed threshold, then the functional dependency is found. Each attribute has an interval X i = [a i ,b i ], i = 1,…, k, as domain of knowledge. Then an uniform fuzzy partition (whose definition is given in Sect. 2) of fuzzy sets A i1 ; A i2 ; :::; The main problem in the use of the inverse F-transform for approximating the function H consists in the fact that the data are not sufficiently dense with respect to the fuzzy partitions. The FAD algorithm solves this problem with an iterative process which is shown in Sect. 3. If the data are not sufficiently dense with respect to the fuzzy partitions, the process stops otherwise an index of determinacy is calculated. If this index is greater than a threshold a, the functional dependency is found and the inverse F-transform is considered as approximation of the function H, otherwise a finer fuzzy partition is set with n: = n ? 1. The FAD algorithm is schematized in Fig. 1.
In this paper, we propose an extension of the FAD algorithm, called MFAD (massive F-transform attribute dependency) for finding dependencies between numerical attributes in massive datasets. In other words, by using a uniform sampling method, we can apply the algorithm of Martino et al. (2010a) to several sample subsets of the data and hence we extend the results obtained to the overall dataset with suitable mathematical artifices.
Indeed, the dataset is partitioned randomly in s subsets having equal cardinality to which we apply the F-transform method.
Let D l ¼ ½a 1l ; b 1l Â Á Á Á Â ½a kl ; b kl ; l ¼ 1; :::; s, be the Cartesian product of the domains of the attributes X 1 , X 2 ,…, X k , where a il and b il are the minimum and maximum values of X i in the lth subset. Hence, the multi-dimensional inverse F-transform H F n 1l n 2l :::n kl is calculated for approximating the function H in the domain D l and an index of determinacy r 2 cl is calculated for evaluating the error in the approximation of H with H F n 1l n 2l :::n kl in D l . For simplicity, we put n 1l = n 2l = Á Á Á = n kl = n l and thus H F n 1l n 2l ÁÁÁn kl = H F n l . In order to obtain the final approximation of H, we introduce weights for considering the contribute of the inverse Ftransform H F n l in the approximation of H. We calculate the weighted mean of H F n 1 ,…, H F n s replacing the weights with the indices of determinacy r 2 c1 ,…, r 2 cs . Calculate the approximated value of H F in ðx 1 ; :::; x k Þ 2 S s l¼1 D l given by H F ðx 1 ; x 2 ; :::; x k Þ ¼ P s l¼1 w l ðx 1 ; x 2 ; :::; x k Þ Á H F n l ðx 1 ; x 2 ; :::; x k Þ P s l¼1 w l ðx 1 ; x 2 ; :::; where w l ðx 1 ; x 2 ; :::; For example, we consider two attributes, X 1 and X 2 , as inputs and suppose, for simplicity, that the dataset is partitioned in two subsets. Figure 2 shows two rectangles D 1 (red) and D 2 (green). The zone labeled as A of the input space is covered by the domain D 2 : in this zone the weight w 1 is null and In the zone labeled as B, the inverse F-transforms, calculated for both subsets, contribute to the final evaluation of H, with a weight corresponding to the index of determinacy. Figure 3 shows the schema of MFAD. We apply our method on a L dataset loadable in memory, so we can apply also the method of Martino et al. (2010a) and hence we compare the results obtained by using both methods. As test dataset, we consider the last Italian census data acquired during 2011 by ISTAT (Italian National Statistical Institute). Section 2 contains the F-transform in one and more variables ). In Sect. 3, the F-transform attribute dependency method is presented, Sect. 4 contains the results of our tests. Conclusions are described in Sect. 5.

F-transforms in one and k variables
Following the definitions of Perfilieva (2006). We recall the main notations for making this paper self-contained.

FAD algorithm
We schematize a dataset in tabular form as Here X 1 ,…, X i ,…, X r are the involved attributes and O 1 ,…, O j ,…, O m (m [ r) are the instances and p ji is the value of the attribute X i for the instance O j . Each attribute X i can be considered as a numerical variable assuming values in the domain [a i ,b i ], where a i = min{p 1i ,…, p mi } and b i = max{p 1i ,…, p mi }. We analyze the functional dependency between attributes in the form: where z 2{1,…, r}, k B r \ m, X z = X 1 , X 2 , …,X k ,, H: In [a i ,b i ], i = 1,2,…, k, an uniform partition of A i1 ; :::; A ij ; :::; A in È É is defined for i = 1,…, k and j = 2,…, k-1: where By setting H (p j1 ,p j2 ,…, p jk ) = p jz for j = 1,2,…, m, the components of H are given by The error of the approximation is evaluated in (p j1 ,p j2 ,…, p jm ) by using the following statistical index of determinacy (Draper and Smith 1988;Johnson and Wichern 1992): H F n 1 n 2 :::n k ðp j1 ; p j2 ; :::p jk Þ Àp z 2 P m j¼1 p jz Àp z wherep z is the mean of the values of the attribute X z. If r 2 c = 0 (resp., r 2 c = 1) means that (11) does not fit (resp., fits perfectly) to the data. However we use a variation of (11) for taking into account both the number of independent variables and the scale of the sample used (Martino et al. 2010a) given by The pseudocode of the algorithm FAD is schematized below.
The function DirectFuzzyTransform() is used to calculate each direct F-transform component. The function BasicFunction() calculates the value A ihi ðxÞ for an assigned x of the h i th basic function of the ith fuzzy partition. IndexofDeterminacy calculates the index of determinacy.

MFAD algorithm
We consider a massive dataset DT composed by r attributes where X 1 ,…, X i ,…, X r and m instances O 1 ,…, O j ,…, O m (m [ r). We make a partition of DT in s subsets DT l ,…, DT s with the same cardinality, by using an uniform random sample in such a way each subset is loadable in memory. We apply the FAD algorithm to each subset, calculating the direct F-transform components, the inverse F-transforms H F n 1 …H F n s , the indices of determinacy r 0 2 c1 ,…, r 0 2 cs . r 0 2 cs and the domains D l ,…, D s , where D l ¼ ½a 1l ; b 1l Â Á Á Á Â ½a kl ; b kl ; l = 1,…, s. All these quantities are saved in memory. If a dependency f is not found for the lth subset, the corresponding value of r 0 2 cl is set to 0. The pseudocode of MFAD is given below. Now we consider a point (x 1 ,x 2 ,…, x k ) 2 S s l¼1 D l . In order to approximate the function H(x 1 ,x 2 ,…, x k ), we calculate the weights as: which is also the value of X z . To analyze the performance of the MFAD algorithm, we execute a set of experiments on a large dataset formed from 402,678 census tracts of the Italian regions provided by the Italian National Statistical Institute (ISTAT) in 2011. Therein, 140 numerical attributes belong to each of the following categories: • Inhabitants, • Foreigner and stateless inhabitants, The FAD method is applied on the overall dataset, the MFAD method is applied by partitioning the dataset in s subsets, and we perform the tests varying the value of the parameter s and by setting the threshold a = 0.7.
In addition, we compare the MFAD algorithm with the support vector regression (SVR) and multilayer perceptron (MLP) algorithms. Table 1 shows the 402,678 census tracts of Italy divided for each region. Table 2 shows the approximate number of census tracts in each subset for each partition of the dataset in s subsets.

Experiments
In any experiment, we apply the MFAD algorithm to analyze the attribute dependency explored of an output attribute X z from a set of input attributes X 1 , X 2 ,…, X r . In all the experiments, we set a = 0.7 and partition randomly the dataset in s subsets. We now show the results obtained in three experiments.

Experiment A
In this experiment, we explore the relation between the density of resident population with laurea degree and the density of resident population employed. Generally speaking, a higher density of population with laurea degree should correspond to a greater density of population employed. The attribute dependency explored is H z-= H(X 1 ), where • Input attribute: X 1 = Resident population with laurea degree • Output attribute: X z = Resident population over 15 employed We apply the FAD algorithm on different random subsets of the dataset, and then we calculate the index of determinacy (12). In Table 3, we show the value of the index of determinacy r 0 2 cl obtained for different values of s. For s = 1, we have the overall dataset.
The results in Table 3 show that the dependency has been found. We obtain r 0 2 cl = 0.760 by using FAD algorithm on the entire dataset, while the best value of r 0 2 cl (reached by using MFAD) is 0.758 for s = 16. Hence the related smallest difference between the two algorithms is 0.02. Figure 4 shows in abscissas the input X 1 and in ordinates the output H F (x 1 Þ for s = 1, 10, 16, 40.

Experiment B
In this experiment, we explore the relation between the density of residents with job or capital income and the density of families in owned residences. We expect that the greater the density of residents with job or capital income is, the resident families density in owned homes the greater is. The attribute dependency explored is H z = H(X 1 ), where: • Input attributes: X 1 = Resident population with job or capital income • Output attribute X z = Families in owned residences • After some tests, we put a = 0.8. Table 4 shows r 0 2 cl obtained for different values of s: r 0 2 cl = 0.881 in FAD algorithm on the entire dataset, r 0 2 cl = 0.878 in MFAD obtained for s = 13, 16. The smallest index of dependency difference is 0.003. Figure 5 shows in abscissas the input X 1 and in ordinates the output H F (x 1 Þ for s = 1, 10, 16, 40.

Experiment C
In this experiment, the attribute dependency explored is H z = H(X 1 ,X 2 ), where Input attributes: • X 1 = Density of residential buildings built with reinforced concrete • X 2 = Density of residential buildings built after 2005 Output attribute: • X z = Density of residential buildings with state of good conservation After some tests, we decided a = 0.75 in this experiment. In Table 5, we show r 0 2 cl obtained for different values of s: r 0 2 cl = 0.785 in FAD algorithm on the entire dataset. r 0 2 cl = 0.781 in MFAD algorithm obtained for s = 13, 16. The smallest index of dependency difference is 0.004. Now we present the results obtained by considering all the experiments performed on the entire dataset in which the dependency was found (r 0 2 cl [ 0.7). We consider the index of determinacy in the FAD algorithm (s = 1) and the minimum and maximum values of the index of determinacy obtained by using the MFAD algorithm for s = 9, 10, 11, 13, 16, 20, 26, 40. A functional dependency was found in 43 experiments. Figure 6 (resp., 7) shows the trend of the difference between the maximum (resp., minimum) value calculated for r 0 2 cl in MFAD and in FAD for the same experiment. In abscissae, we have r 0 2 cl in the FAD method, in ordinates the difference between the two indices. For all the experiments this difference is always below 0.005 (resp., 0.0015).
These results show that the MFAD algorithm is comparable with the FAD algorithm, independently of the choice of the number of subsets partitioning the entire dataset (Fig. 7). Figure 8 shows the mean CPU time gain obtained by MFAD algorithm with different partitions, with respect to the CPU time obtained by using FAD algorithm (s = 1). The CPU time gain is given by the difference between the CPU time measured by using s = 1, and the CPU time measured by using a partition in s subsets, divided by the CPU time measured for s = 1. The CPU time gain is   always positive and the greatest value are obtained for s = 16. These considerations allow to apply the MFAD algorithm to a VL dataset not loadable entirely in memory to which the FAD algorithm is not applicable. Now we compare the results obtained by using the MFAD method with the ones obtained by applying the SVR and MLP algorithms. For the comparison tests we have used the machine learning tool Weka 3.8.
In order to perform the tests by using the SVR algorithm, we repeat each experiment using the following different kernel functions: linear, polynomial, Pearson VII universal kernel, and Radial Basis Function kernel, and varying the complexity C parameter in a range between 0 and 10. To compare the performances of the SVR and MFAD algorithms we measure the index of determinacy and store it in every experiment.
In Fig. 9 we show the trend of the difference between the max values of r 0 2 cl in SVR and MFAD. Figure 9 shows that the difference between the optimal value r 0 2 cl in SVR and MFAD is always under 0.02. In the comparison tests performed by using the MLP algorithm, we vary the learning rate and the momentum parameter in [0.1,1]. We use a single hidden layer varying the number of nodes between 2 and 8. Furthermore, we set the number of epochs to 500 and the percentage size of validation set to 0. In Fig. 10 we show the trend of the difference between the max value of r 0 2 cl in MLP and MFAD. Figure 10 shows that the difference between the max value of the index of determinacy in MLP and MFAD is under the value 0.016.
These results show that the MFAD algorithm of attribute dependency in massive datasets has comparable performances with the SVR and MLP nonlinear regression algorithms. Moreover, it has the advantage of having a smaller number of parameters compared to the other two algorithms, therefore it has greater usability and can be easily integrated into expert systems and intelligent systems for the analysis of dependencies between attributes in massive datasets. Indeed, the only two parameters for the execution of the MFAD algorithm are the number of subsets and the threshold value of the index of determinacy.  These results allow us to conclude that MFAD provides acceptable performance in the detection of attribute dependencies in the presence of massive datasets. Therefore, unlike FAD, MFAD can be applied to massive data and can represent a trade-off between usability and high performance in detecting attribute dependencies in massive datasets.
The critical point of the algorithm is the choice of the number of subsets and the threshold value of the index of determinacy. Further studies on massive datasets are necessary to analyze if the choice of the optimal values of these two parameters depend on the type of dataset analyzed. Furthermore, we intend to experiment the MFAD algorithm in future robust frameworks such as expert systems and decision support systems.
Author contributions All authors contributed to the study conception and design. All authors contributed to material preparation, data collection and analysis. All authors wrote the first draft of the manuscript commented on previous versions of the manuscript. All authors read and approved the final manuscript. Funding Open access funding provided by Università degli Studi di Napoli Federico II within the CRUI-CARE Agreement. This research received no external funding.

Declarations
Conflict of interest The authors declare no conflict of interest.
Ethical approval This research does not contain any studies involving human participants performed by any of the authors.
Informed consent Informed consent was obtained from all individual participants included in the study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.