1 Introduction

Data analysis and data mining knowledge discovery processes represent powerful functionalities that can be combined in knowledge-based expert and intelligent systems in order to extract and build knowledge starting by data. In particular, attribute dependency data analysis is an activity necessary to reduce the dimensionality of the data and to detect hidden relations between features. Nowadays, in many application fields, data sources are massive (for example, web social data, sensor data, etc.), and it is necessary to implement knowledge extraction methods that can operate on massive data. Massive (Very Large (VL) and Large (L)) datasets (Chen and Zhang 2014) are produced and updated and they cannot be managed by traditional databases. Today, access via the Web to these datasets has led to develop technologies for managing them (cfr., e.g., (Dean 2014; Leskovec et al. 2014; Singh et al. 2015)).

We recall the regression analysis (cfr., e.g., (Draper and Smith 1988; Han et al. 2012; Johnson and Wichern 1992; Jun et al. 2015; Piatecky–Shapiro and Frawley 1991)) for estimating relationships among variables in the datasets (cfr., e.g., (Lee and Yen 2004; Mitra et al. 2002; Tanaka 1987; Wood et al. 2015)) and fuzzy tools for attribute dependency (Vucetic et al. 2013; Yen and Lee 2011).

Machine learning soft computing models were proposed in the literature to perform nonlinear regressions on high dimensional data; two well-known machine learning nonlinear regression algorithms are support vector regression (SVR) (Drucker et al. 1996) and multilayer perceptron (MLP) (cfr., e.g., (Collobert and Bengio 2004; Cybenko 1989; Hastie et al. 2009; Haykin 1999,2009; Murtagh 1991; Schmidhube 2014)) algorithms. The main problems of these algorithms are the complexity of the model due to the presence of many parameters to be set by the user, and the presence of overfitting, phenomenon in which the regression function fits optimally the training set data, but fails in predictions on new data. K-fold cross-validation techniques are proposed in the literature to avoid overfitting (Anguita et al. 2005). In Thomas and Suhner (2015), a pruning method based on variance sensitivity analysis is proposed to find the optimal structure of a multilayer perceptron in order to mitigate overfitting problems. In Han and Jian (2019), a novel sparse-coding kernel algorithm is proposed to overcome overfitting in disease diagnosis.

Some authors proposed variations of nonlinear machine learning regression models to manage massive data. In Cheng et al. (2010), Segata and Blanzieri (2009) a fast-local support vector machine (SVM) method to manage large datasets are presented in which a set of multiple local SVMs for low-dimensional data are constructed. In Zheng et al. (2013), the authors proposed an incremental version of the vector machine regression model to manage large-scale data. In Peng et al. (2013) the authors proposed a parallel architecture of a logistic regression model for massive data management. Recently, variations of the extreme learning machine (ELM) regression methods for massive data based on the MapReduce model are presented (Chen et al. 2017; Yao and Ge 2019).

The presence of a high number of parameters makes SVR and MLP methods too complex to be integrated as components into an intelligent or expert system. In this research, we propose a model of attribute dependency in massive datasets based on the use of the multi-dimensional fuzzy transform. We extend the attribute dependency method presented in Martino et al. (2010a) to massive datasets in which the inverse multi-dimensional fuzzy transform is used as a regression function. Our goal is to guarantee a high performance of the proposed method in the analysis of massive data, maintaining, at the same time, the usability of the previous multi-dimensional fuzzy transform attribute dependency. As in Jun et al. (2015), we use a random sampling algorithm for subdividing the dataset in subsets of equal cardinality.

The fuzzy transform (F-transform) method (Perfilieva 2006) is a technique which approximates a given function by means of another function unless an arbitrary constant. This approach is particularly flexible in the applications such as image processing (cfr., e.g., (Martino et al. 2008,2010b,2011b; Martino and Sessa 2007,2012)), data analysis (cfr., e.g., (Martino et al. 2010a,2011a; Perfilieva et al. 2008)). In this last work, an algorithm, called FAD (F-transform Attribute Dependence), evaluates an attribute Xz depending from k attributes X1Xk. (predictors) with z ∉ {1,2,…k}, i.e. Xz = H(X1Xk), and the (unknown) function H is approximated with the inverse multi-dimensional F-transform via a procedure presented in Perfilieva et al. (2008). The error of this approximation in Martino et al. (2010a) is measured from a statistical index of determinacy (Draper and Smith 1988; Johnson and Wichern 1992). If it overcomes a prefixed threshold, then the functional dependency is found. Each attribute has an interval Xi = [ai,bi], i = 1,…, k, as domain of knowledge. Then an uniform fuzzy partition (whose definition is given in Sect. 2) of fuzzy sets \(\left\{ {A_{i1} ,A_{i2} ,...,A_{{{\text{in}}_{i} }} } \right\}\) defined on [ai,bi] is created assuming ni ≥ 3.

The main problem in the use of the inverse F-transform for approximating the function H consists in the fact that the data are not sufficiently dense with respect to the fuzzy partitions. The FAD algorithm solves this problem with an iterative process which is shown in Sect. 3. If the data are not sufficiently dense with respect to the fuzzy partitions, the process stops otherwise an index of determinacy is calculated. If this index is greater than a threshold α, the functional dependency is found and the inverse F-transform is considered as approximation of the function H, otherwise a finer fuzzy partition is set with n: = n + 1. The FAD algorithm is schematized in Fig. 1.

Fig. 1
figure 1

Flux diagram of the FAD algorithm

In this paper, we propose an extension of the FAD algorithm, called MFAD (massive F-transform attribute dependency) for finding dependencies between numerical attributes in massive datasets. In other words, by using a uniform sampling method, we can apply the algorithm of Martino et al. (2010a) to several sample subsets of the data and hence we extend the results obtained to the overall dataset with suitable mathematical artifices.

Indeed, the dataset is partitioned randomly in s subsets having equal cardinality to which we apply the F-transform method.

Let \(D_{l} = [a_{1l} ,b_{1l} ] \times \cdots \times [a_{kl} ,b_{kl} ],\) \(l = 1,...,s\), be the Cartesian product of the domains of the attributes X1, X2,…, Xk, where \(a_{il}\) and \(b_{il}\) are the minimum and maximum values of Xi in the lth subset. Hence, the multi-dimensional inverse F-transform \(H_{{n_{1l} n_{2l} ...n_{kl} }}^{F}\) is calculated for approximating the function H in the domain Dl and an index of determinacy \(r_{cl}^{2}\) is calculated for evaluating the error in the approximation of H with \(H_{{n_{1l} n_{2l} ...n_{kl} }}^{F}\) in Dl. For simplicity, we put n1l = n2l =  \(\cdots \) = nkl = nl and thus \(H_{{n_{1l} n_{2l} \cdots n_{kl} }}^{F}\) = \(H_{{n_{l} }}^{F}\). In order to obtain the final approximation of H, we introduce weights for considering the contribute of the inverse F-transform \(H_{{n_{l} }}^{F}\) in the approximation of H. We calculate the weighted mean of \(H_{{n_{1} }}^{F}\),…, \(H_{{n_{s} }}^{F}\) replacing the weights with the indices of determinacy \(r_{c1}^{2}\),…, \(r_{cs}^{2}\).

Calculate the approximated value of \(H^{F} \, \) in \((x_{1} ,...,x_{k} ) \in \bigcup\nolimits_{l = 1}^{s} {D_{l} }\) given by

$$ H^{F} (x_{1} ,x_{2} ,...,\,x_{k} ) \, = \frac{{\sum\nolimits_{l = 1}^{s} {w_{l} (x_{1} ,x_{2} ,...,x_{k} ) \cdot H_{{n_{l} }}^{F} (x_{1} ,x_{2} ,...,x_{k} )} }}{{\sum\nolimits_{l = 1}^{s} {w_{l} (x_{1} ,x_{2} ,...,x_{k} )} }} \, $$
(1)

where

$$ w_{l} (x_{1} ,x_{2} ,...,x_{k} ) = \left\{ {\begin{array}{*{20}c} {r_{{cl}}^{2} } & {{\text{if }}(x_{1} ,x_{2} ,...,x_{k} ) \in {\text{ }}D_{l} } \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. $$
(2)

For example, we consider two attributes, X1 and X2, as inputs and suppose, for simplicity, that the dataset is partitioned in two subsets. Figure 2 shows two rectangles D1 (red) and D2 (green). The zone labeled as A of the input space is covered by the domain D2: in this zone the weight w1 is null and \(H^{F} = H_{2}^{F}\). Conversely, in the zone C the contribute of \(H_{2}^{F}\) is null and \(H^{F} = H_{1}^{F}\). In the zone labeled as B, the inverse F-transforms, calculated for both subsets, contribute to the final evaluation of H, with a weight corresponding to the index of determinacy.

Fig. 2
figure 2

Example of union of domains of the subsets in which the dataset is partitioned

Figure 3 shows the schema of MFAD. We apply our method on a L dataset loadable in memory, so we can apply also the method of Martino et al. (2010a) and hence we compare the results obtained by using both methods. As test dataset, we consider the last Italian census data acquired during 2011 by ISTAT (Italian National Statistical Institute). Section 2 contains the F-transform in one and more variables (Perfilieva et al. 2008). In Sect. 3, the F-transform attribute dependency method is presented, Sect. 4 contains the results of our tests. Conclusions are described in Sect. 5.

Fig. 3
figure 3

Schema of the MFAD method

2 F-transforms in one and k variables

Following the definitions of Perfilieva (2006). We recall the main notations for making this paper self-contained. Let n ≥ 2, x1, x2,…, xn be points (nodes) of [a,b], x1 = a < x2 <  \(\cdots\) < xn = b. The fuzzy sets A1,…, An: [a,b] → [0,1] (basic functions) constitute a fuzzy partition of [a,b] if Ai(xi) = 1 for i = 1,2,…, n; Ai(x) = 0 if x \(\notin\)(xi-1,xi+1) for i = 2,…, n; Ai(x) is a continuous on [a,b]; Ai(x) strictly increases on [xi-1, xi] for i = 2,…, n and strictly decreases on [xi,xi+1] for i = 1,…, n-1; \( \, \sum\nolimits_{i = 1}^{n} {A_{i} (x) = 1}\) for every x \(\in\)[a,b]. The partition{A1(x),…, An(x)} is said uniform if n ≥ 3, xi = a + h∙(i-1), where h = (b-a)/(n-1) and i = 1, 2,…, n (equidistance); Ai(xi-x) = Ai(xi + x) for x \(\in\)[0,h] and i = 2,…, n-1; Ai+1(x) = Ai(x-h) for x \(\in\)[xi, xi+1] and i = 1,2,…, n-1.

We know that the function f assumes given values in the points p1,…, pm of [a,b],. If the set P = {p1,…, pm} is sufficiently dense with respect to {A1, A2,…, An}, that is for every i \(\in\){1,…, n} there exists an index j \(\in\){1,…, m} such that Ai(pj) > 0, then the n-tuple \( \, [F_{1} ,F_{2} ,...,F_{n} ]\) is the discrete direct F-transform of f with respect to {A1, A2,…, An}, where each Fi is given by

$$ F_{i} = \frac{{\sum\nolimits_{j = 1}^{m} {f(p_{j} )A_{i} (p_{j} )} }}{{\sum\nolimits_{j = 1}^{m} {A_{i} (p_{j} )} }} $$
(3)

for i = 1,…, n. Then we define the discrete inverse F-transform of f with respect to the basic functions {A1, A2,…, An} by setting

$$ f_{F,n} (p_{j} ) = \sum\limits_{i = 1}^{n} {F_{i} } A_{i} (p_{j} ) $$
(4)

for every j \(\in\){1,…, m}. Now we recall concepts from Perfilieva et al. (2008). The F-transforms can be extended to k (≥ 2) variables considering the Cartesian product of intervals [a1,b1] \(\times\) [a2,b2] \(\times\)  \(\cdots\,\times\) [ak,bk]. Let \(x_{11} ,x_{12} ,....,x_{{1n_{1} }}\)\(\in\)[a1,b1],…, \(x_{k1} ,x_{k2} ,....,x_{{kn_{k} }}\)\(\in\)[ak,bk] be n1 +  \(\cdots\) + nk assigned points (nodes) such that xi1 = ai < xi2 < \(\cdots\) < \(x_{{in_{i} }}\) = bi and {\(A_{i1} ,A_{i2} ,....,A_{{in_{i} }}\)} be a fuzzy partition of [ai,bi] for i = 1,…, k. Let the function \(f\)(x1,x2,…, xk) be assuming values in m points pj = (pj1, pj2,…, pjk) \(\in\) [a1,b1] \(\times\) [a2,b2] \(\times\)\( \cdots \, \times\) [ak,bk] for j = 1,…, m. The set P = {(p11, p12,…, p1k), (p21, p22,…, p2k),…, (pm1, pm2,…, pmk)} is said sufficiently dense with respect to \(\left\{ {A_{11} ,A_{12} ,...,A_{{1n_{1} }} } \right\}\),…, \(\left\{ {A_{k1} ,A_{k2} ,...,A_{{kn_{k} }} } \right\}\) if for {h1,…, hk}\(\in\){1,…, n1} \(\times\) … \(\times\) {1,…, nk} there exists pj = (pj1,pj2,…, pjk)\(\in\) P with \(A_{{1h_{1} }} (p_{j1} ) \cdot A_{{2h_{2} }} (p_{j2} ) \cdot \ldots \cdot A_{kh_K} (p_{jk} ) > 0\), j \(\in\) {1,…, m}. Then we define the (h1,h2,…, hk)th component \(F_{{h_{1} h_{2} ...h_{K} }}\) of the discrete direct F-transform of f with respect to \(\left\{ {A_{11} ,A_{12} ,...,A_{{1n_{1} }} } \right\}\), …,\(\left\{ {A_{k1} ,A_{k2} ,...,A_{{kn_{k} }} } \right\}\) as

$$ F_{{h_{1} h_{2} ...h_{K} }} = \frac{{\sum\nolimits_{j = 1}^{m} {f(p_{{j1_{{}} }} ,p_{j2} ,...\,p_{{jk_{{}} }} ) \cdot A_{{1h_{1} }} (p_{j1} ) \cdot A_{{2h_{2} }} (p_{j2} ) \cdot ... \cdot A_{{kh_{K} }} (p_{jk} )} }}{{\sum\nolimits_{j = 1}^{m} {A_{{1h_{1} }} (p_{j1} ) \cdot A_{{2h_{2} }} (p_{j2} ) \cdot ... \cdot A_{{kh_{K} }} (p_{jk} )} }} $$
(5)

Thus we define the discrete inverse F-transform of f with respect to \(\left\{ {A_{11} ,A_{12} ,...,\,A_{{1n_{1} }} } \right\}\),…, \(\left\{ {A_{k1} ,A_{k2} ,...,A_{{kn_{K} }} } \right\}\) by setting for pj = (pj1, pj2,…, pjk)\(\in\) [a1,b1] \(\times\) … \(\times\) [ak,bk]:

$$ f_{{n_{1} n_{2} ...n_{K} }}^{F} (p_{j1} ,p_{j2} , \ldots ,p_{jk} ) = \sum\limits_{{h_{1} = 1}}^{{n_{1} }} {\sum\limits_{{h_{2} = 1}}^{{n_{2} }} {...\sum\limits_{{h_{K} = 1}}^{{n_{k} }} {F_{{h_{1} h_{2} ...h_{K} }} \cdot A_{{1h_{1} }} (p_{j1} ) \cdot \ldots \cdot A_{{kh_{K} }} (p_{jk} )} } } $$
(6)

for j = 1,…, m. The following Theorem holds (Perfilieva 2006):

Theorem 1

Let f(x 1 ,x 2,…, x k ) be a function assigned on the set of points P = {(p 11 ,p 12 , …,p 1k ),(p 21 , p 22 ,…, p 2k ),…, (p m1 , p m2 , …,p mk )} \(\subseteq\) [a 1 ,b 1 ] \(\times\)  [a 2 ,b 2 ] \(\times\) \( \cdots \, \times\)  [a k ,b k ]. Then for every ε > 0, there exist k integers n 1 (ε),…, n k (ε) and related fuzzy partitions.

$$ \left\{ {A_{11} ,A_{12} ,...,A_{{1n_{1} (\varepsilon )}} } \right\}, \ldots ,\left\{ {A_{k1} ,A_{k2} ,...,A_{{kn_{k} \left( \varepsilon \right)}} } \right\} $$
(7)

such that the set P is sufficiently dense with respect to fuzzy partitions (5) and for every p j  = (p j1, p j2 ,…, p jk ) \(\in\) P, j = 1,…, m, the following inequality holds:

$$ \left| {f(p_{j1} ,p_{j2} ,...,p_{jk} ) - f_{{n_{1} (\varepsilon )n_{2} (\varepsilon )}}^{F} ..._{{n_{k} (\varepsilon )}} (p_{j1} ,p_{j2} ,...,p_{jk} )} \right| < \varepsilon $$
(8)

3 Multi-dimensional algorithm for massive datasets

3.1 FAD algorithm

We schematize a dataset in tabular form as

figure a

Here X1,…, Xi,…, Xr are the involved attributes and O1,…, Oj,…, Om (m > r) are the instances and pji is the value of the attribute Xi for the instance Oj. Each attribute Xi can be considered as a numerical variable assuming values in the domain [ai,bi], where ai = min{p1i,…, pmi} and bi = max{p1i,…, pmi}. We analyze the functional dependency between attributes in the form:

$$ X_{z} = \, H\left( {X_{1} , \ldots ,X_{k} } \right) $$
(9)

where z \(\in\){1,…, r}, k ≤ r < m, Xz ≠ X1, X2, …,Xk,, H: [a1,b1] \(\times\) [a2,b2] \(\times\) … \(\times\) [ak,bk] \(\to\) [az,bz] is continuous. In [ai,bi], i = 1,2,…, k, an uniform partition of \(\left\{ {A_{i1} ,...,A_{ij} ,...,A_{in} } \right\}\) is defined for i = 1,…, k and j = 2,…, k-1:

$$ \begin{aligned} A_{{i1}} (x) & = \left\{ {\begin{array}{*{20}l} {0.5 \cdot (1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{i1}} ))} & {{\text{if }}x{\text{ }} \in {\text{ }}[x_{{i1}} ,x_{{i2}} ]} \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. \\ A_{{ij}} (x) & = \left\{ {\begin{array}{*{20}l} {0.5 \cdot (1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{ij}} ))} & {{\text{if }}x{\text{ }} \in {\text{ }}[x_{{i(j - 1)}} ,x_{{i(j + 1)}} ]} \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. \\ A_{{in}} (x) & = \left\{ {\begin{array}{*{20}l} {0.5 \cdot (1 + \cos \frac{\pi }{{h_{i} }}(x - x_{{in}} ))} & {{\text{if }}x{\text{ }} \in {\text{ }}[x_{{i(n - 1)}} ,x_{{in}} ]} \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. \\ \end{aligned} $$
(10)

where hi = (bi-ai)/(n-1), xij = ai + hi·(j-1).

By setting H (pj1,pj2,…, pjk) = pjz for j = 1,2,…, m, the components of H are given by

$$ F_{{h_{1} h_{2} \ldots h_{k} }} = \frac{{\sum\nolimits_{j = 1}^{m} {p_{jz} \cdot A_{{1h_{1} }} (p_{j1} ) \cdot \ldots \cdot A_{{kh_{K} }} (p_{jk} )} }}{{\sum\nolimits_{j = 1}^{m} {A_{{1h_{1} }} (p_{j1} ) \cdot \ldots \cdot A_{{kh_{K} }} (p_{jk} )} }} $$
(11)

The inverse F-transform \(H_{{n_{1} n_{2} ...n_{k} }}^{F}\) is defined as

$$ H_{n}^{F} (p_{j1} ,p_{j2} ,...p_{jk} ) = \sum\limits_{{h_{1} = 1}}^{n} {\sum\limits_{{h_{2} = 1}}^{n} {...\sum\limits_{{h_{k} = 1}}^{n} {F_{{h_{1} h_{2} ...h_{K} }} \cdot A_{{1h_{1} }} (p_{j1} ) \cdot ... \cdot A_{{kh_{K} }} (p_{jk} )} } } $$
(12)

The error of the approximation is evaluated in (pj1,pj2,…, pjm) by using the following statistical index of determinacy (Draper and Smith 1988; Johnson and Wichern 1992):

$$ r_{c}^{2} = \frac{{\sum\nolimits_{j = 1}^{m} {\left( {H_{{n_{1} n_{2} ...n_{k} }}^{F} (p_{j1} ,p_{j2} ,...p_{jk} ) - \hat{p}_{z} } \right)^{2} } }}{{\sum\nolimits_{j = 1}^{m} {\left( {p_{jz} - \hat{p}_{z} } \right)^{2} } }} $$
(13)

where \(\hat{p}_{z}\) is the mean of the values of the attribute Xz. If \(r_{c}^{2}\) = 0 (resp., \(r_{c}^{2}\) = 1) means that (11) does not fit (resp., fits perfectly) to the data. However we use a variation of (11) for taking into account both the number of independent variables and the scale of the sample used (Martino et al. 2010a) given by

$$ {r^ \prime}_{c}^{2} = 1 - \left[ {\left( {1 - r_{c}^{2} } \right) \cdot \frac{m - 1}{{m - k - 1}}} \right] $$
(14)

The pseudocode of the algorithm FAD is schematized below.

figure b

The function DirectFuzzyTransform() is used to calculate each direct F-transform component. The function BasicFunction() calculates the value \(A_{ihi} (x)\) for an assigned x of the hith basic function of the ith fuzzy partition. IndexofDeterminacy calculates the index of determinacy.

figure c
figure d
figure e

3.2 MFAD algorithm

We consider a massive dataset DT composed by r attributes where X1,…, Xi,…, Xr and m instances O1,…, Oj,…, Om (m > r). We make a partition of DT in s subsets DTl,…, DTs with the same cardinality, by using an uniform random sample in such a way each subset is loadable in memory. We apply the FAD algorithm to each subset, calculating the direct F-transform components, the inverse F-transforms \(H_{{n_{1} }}^{F}\)\(H_{{n_{s} }}^{F}\), the indices of determinacy \(r_{c1}^{^{\prime}2}\),…, \({r^ \prime}_{cs}^{2}\). \({r^ \prime}_{cs}^{2}\) and the domains Dl,…, Ds, where \(D_{l} = [a_{1l} ,b_{1l} ] \times \cdots \times [a_{kl} ,b_{kl} ],\) l = 1,…, s. All these quantities are saved in memory. If a dependency f is not found for the lth subset, the corresponding value of \({r^ \prime}_{cl}^{2}\) is set to 0. The pseudocode of MFAD is given below.

figure f

Now we consider a point (x1,x2,…, xk) \(\in\) \(\bigcup\limits_{l = 1}^{s} {D_{l} }\). In order to approximate the function H(x1,x2,…, xk), we calculate the weights as:

$$ w_{l}^{\prime} (x_{1} ,x_{2} ,...,x_{k} ) = \left\{ {\begin{array}{*{20}l} {r_{{cl}}^{\prime2} } & {\text{if }}(x_{1} ,x_{2} ,\ldots,x_{k} ) \in {D_{l} } \\ 0 & {{\text{otherwise}}} \\ \end{array} \begin{array}{*{20}l} {l = 1,...,s} \\ \end{array} } \right. $$
(15)

If for any subset the functional dependency is not found, then \(w_{l}^{^{\prime}}\) = 0 for each l = 1,…, s. Otherwise, the approximated value of H(x1,x2,…, xk) is given by

$$ H^{F} (x_{1} ,x_{2} ,...,x_{k} ) \, = \frac{{\sum\nolimits_{i = 1}^{s} {w_{i}^{^{\prime}} (x_{1} ,x_{2} ,...,x_{k} ) \cdot H_{{n_{l} }}^{F} (x_{1} ,x_{2} ,...,x_{k} )} }}{{\sum\nolimits_{l = 1}^{s} {w_{i}^{^{\prime}} (x_{1} ,x_{2} ,...,x_{k} )} }} \, $$
(16)

which is also the value of Xz. To analyze the performance of the MFAD algorithm, we execute a set of experiments on a large dataset formed from 402,678 census tracts of the Italian regions provided by the Italian National Statistical Institute (ISTAT) in 2011. Therein, 140 numerical attributes belong to each of the following categories:

  • Inhabitants,

  • Foreigner and stateless inhabitants,

  • Families,

  • Buildings,

  • Dwellings.

The FAD method is applied on the overall dataset, the MFAD method is applied by partitioning the dataset in s subsets, and we perform the tests varying the value of the parameter s and by setting the threshold α = 0.7.

In addition, we compare the MFAD algorithm with the support vector regression (SVR) and multilayer perceptron (MLP) algorithms.

4 Experiments

Table 1 shows the 402,678 census tracts of Italy divided for each region.

Table 1 Number of census tracts for each Italian region

Table 2 shows the approximate number of census tracts in each subset for each partition of the dataset in s subsets.

Table 2 Number of census tracts for each subset by varying s

In any experiment, we apply the MFAD algorithm to analyze the attribute dependency explored of an output attribute Xz from a set of input attributes X1, X2,…, Xr. In all the experiments, we set α = 0.7 and partition randomly the dataset in s subsets. We now show the results obtained in three experiments.

4.1 Experiment A

In this experiment, we explore the relation between the density of resident population with laurea degree and the density of resident population employed. Generally speaking, a higher density of population with laurea degree should correspond to a greater density of population employed. The attribute dependency explored is Hz = H(X1), where

  • Input attribute: X1 = Resident population with laurea degree

  • Output attribute: Xz = Resident population over 15 employed

We apply the FAD algorithm on different random subsets of the dataset, and then we calculate the index of determinacy (12). In Table 3, we show the value of the index of determinacy \({r^ \prime}_{cl}^{2} \, \) obtained for different values of s. For s = 1, we have the overall dataset.

Table 3 Index of determinacy for values of s in experiment A via FAD

The results in Table 3 show that the dependency has been found. We obtain \({r^ \prime}_{cl}^{2} \, \) = 0.760 by using FAD algorithm on the entire dataset, while the best value of \({r^ \prime}_{cl}^{2} \, \)(reached by using MFAD) is 0.758 for s = 16. Hence the related smallest difference between the two algorithms is 0.02. Figure 4 shows in abscissas the input X1 and in ordinates the output \(H^{F} {\text{(x}}_{{1}} {) }\) for s = 1, 10, 16, 40.

Fig. 4
figure 4

Tendency of Hz for dataset partitions in the experiment A

4.2 Experiment B

In this experiment, we explore the relation between the density of residents with job or capital income and the density of families in owned residences. We expect that the greater the density of residents with job or capital income is, the resident families density in owned homes the greater is. The attribute dependency explored is Hz = H(X1), where:

  • Input attributes: X1 = Resident population with job or capital income

  • Output attribute Xz = Families in owned residences

  • After some tests, we put α = 0.8.

Table 4 shows \({r^ \prime}_{cl}^{2} \, \) obtained for different values of s: \({r^ \prime}_{cl}^{2} \, \) = 0.881 in FAD algorithm on the entire dataset, \({r^ \prime}_{cl}^{2} \, \) = 0.878 in MFAD obtained for s = 13, 16. The smallest index of dependency difference is 0.003.

Table 4 Index of determinacy for values of s in experiment B via FAD

Figure 5 shows in abscissas the input X1 and in ordinates the output \(H^{F} {\text{(x}}_{{1}} {) }\) for s = 1, 10, 16, 40.

Fig. 5
figure 5

Trend of Hz for dataset partitions in the experiment B

4.3 Experiment C

In this experiment, the attribute dependency explored is Hz = H(X1,X2), where

Input attributes:

  • X1 = Density of residential buildings built with reinforced concrete

  • X2 = Density of residential buildings built after 2005

Output attribute:

  • Xz = Density of residential buildings with state of good conservation

After some tests, we decided α = 0.75 in this experiment. In Table 5, we show \({r^ \prime}_{cl}^{2} \, \) obtained for different values of s: \({r^ \prime}_{cl}^{2} \, \) = 0.785 in FAD algorithm on the entire dataset. \({r^ \prime}_{cl}^{2} \, \) = 0.781 in MFAD algorithm obtained for s = 13, 16. The smallest index of dependency difference is 0.004.

Table 5 Index of determinacy for values of s in the experiment C via FAD

Now we present the results obtained by considering all the experiments performed on the entire dataset in which the dependency was found (\({r^ \prime}_{cl}^{2} \, \) > 0.7). We consider the index of determinacy in the FAD algorithm (s = 1) and the minimum and maximum values of the index of determinacy obtained by using the MFAD algorithm for s = 9, 10, 11, 13, 16, 20, 26, 40.

A functional dependency was found in 43 experiments. Figure 6 (resp., 7) shows the trend of the difference between the maximum (resp., minimum) value calculated for \({r^ \prime}_{cl}^{2} \, \) in MFAD and in FAD for the same experiment. In abscissae, we have \({r^ \prime}_{cl}^{2} \, \) in the FAD method, in ordinates the difference between the two indices. For all the experiments this difference is always below 0.005 (resp., 0.0015).

Fig. 6
figure 6

Trend of the difference between the max value \({r^ \prime}_{cl}^{2} \, \) in MFAD and FAD

These results show that the MFAD algorithm is comparable with the FAD algorithm, independently of the choice of the number of subsets partitioning the entire dataset (Fig. 7).

Fig. 7
figure 7

Trend of the difference between the min value of \({r^ \prime}_{cl}^{2} \, \) in MFAD and FAD

Figure 8 shows the mean CPU time gain obtained by MFAD algorithm with different partitions, with respect to the CPU time obtained by using FAD algorithm (s = 1). The CPU time gain is given by the difference between the CPU time measured by using s = 1, and the CPU time measured by using a partition in s subsets, divided by the CPU time measured for s = 1. The CPU time gain is always positive and the greatest value are obtained for s = 16. These considerations allow to apply the MFAD algorithm to a VL dataset not loadable entirely in memory to which the FAD algorithm is not applicable.

Fig. 8
figure 8

Trend of CPU time gain with respect to FAD method (s = 1)

Now we compare the results obtained by using the MFAD method with the ones obtained by applying the SVR and MLP algorithms. For the comparison tests we have used the machine learning tool Weka 3.8.

In order to perform the tests by using the SVR algorithm, we repeat each experiment using the following different kernel functions: linear, polynomial, Pearson VII universal kernel, and Radial Basis Function kernel, and varying the complexity C parameter in a range between 0 and 10. To compare the performances of the SVR and MFAD algorithms we measure the index of determinacy and store it in every experiment.

In Fig. 9 we show the trend of the difference between the max values of \({r^ \prime}_{cl}^{2} \, \) in SVR and MFAD.

Fig. 9
figure 9

Trend of the difference between the max value of \({r^ \prime}_{cl}^{2} \, \) obtained in SVR and MFAD

Figure 9 shows that the difference between the optimal value \({r^ \prime}_{cl}^{2} \, \) in SVR and MFAD is always under 0.02. In the comparison tests performed by using the MLP algorithm, we vary the learning rate and the momentum parameter in [0.1,1]. We use a single hidden layer varying the number of nodes between 2 and 8. Furthermore, we set the number of epochs to 500 and the percentage size of validation set to 0.

In Fig. 10 we show the trend of the difference between the max value of \({r^ \prime}_{cl}^{2} \, \) in MLP and MFAD.

Fig. 10
figure 10

Trend of the difference between the max value of \({r^ \prime}_{cl}^{2} \, \) in MLP and MFAD

Figure 10 shows that the difference between the max value of the index of determinacy in MLP and MFAD is under the value 0.016.

These results show that the MFAD algorithm of attribute dependency in massive datasets has comparable performances with the SVR and MLP nonlinear regression algorithms. Moreover, it has the advantage of having a smaller number of parameters compared to the other two algorithms, therefore it has greater usability and can be easily integrated into expert systems and intelligent systems for the analysis of dependencies between attributes in massive datasets. Indeed, the only two parameters for the execution of the MFAD algorithm are the number of subsets and the threshold value of the index of determinacy.

5 Conclusions

The FAD method presented in (Martino et al. 2010a) can be used as a regression model for finding attribute dependencies in datasets: the inverse multiple F-transform can approximate the regression function. But this method can be expensive for massive datasets and for VL datasets not loaded in memory. Then we propose a variation of the FAD method for massive datasets called MFAD: The dataset is partitioned in s subsets equally sized, to each subset the FAD method is applied by calculating the inverse F-transform. Approximated by a weighted mean where the weights are given from the index of determinacy assigned to each subset. For testing the performance of the MFAD method, we compare tests with respect to the FAD method on an L dataset of the ISTAT 2011 census data. The results show that the performances obtained in MFAD are well comparable in FAD. The comparison tests show that the MFAD algorithm has performances comparable with SVR and MLP algorithms, moreover it has greater usability due to the lower number of parameters to be selected.

These results allow us to conclude that MFAD provides acceptable performance in the detection of attribute dependencies in the presence of massive datasets. Therefore, unlike FAD, MFAD can be applied to massive data and can represent a trade-off between usability and high performance in detecting attribute dependencies in massive datasets.

The critical point of the algorithm is the choice of the number of subsets and the threshold value of the index of determinacy. Further studies on massive datasets are necessary to analyze if the choice of the optimal values of these two parameters depend on the type of dataset analyzed. Furthermore, we intend to experiment the MFAD algorithm in future robust frameworks such as expert systems and decision support systems.