A distributed, proactive intelligent scheme for securing quality in large scale data processing
Abstract
The involvement of numerous devices and data sources in the current form of Web leads to the collection of vast volumes of data. The advent of the Internet of Things (IoT) enhances the devices to act autonomously and transforms them into information and knowledge producers. The vast infrastructure of the Web/IoT becomes the basis for producing data either in a structured or in an unstructured way. In this paper, we focus on a distributed scheme for securing the quality of data as collected and stored in multiple partitions. A high quality is achieved through the adoption of a model that identifies any change in the accuracy of the collected data. The proposed scheme determines if the incoming data negatively affect the accuracy of the already present datasets and when this is the case, it excludes them from further processing. We are based on a scheme that also identifies the appropriate partition where the incoming data should be allocated. We describe the proposed scheme and present simulation and comparison results that give insights on the pros and cons of our solution.
Keywords
Data quality Data accuracy Data partitionsMathematics Subject Classification
68U99 94D051 Introduction
1.1 Motivation
With the advent of the new form of the Web and the Internet of Things (IoT), one can observe increased volumes of data produced by current applications in various domains. Numerous devices generate large scale data that demand intelligent mechanisms for their processing. Usually, data are not structured making their processing more difficult. Unstructured data cannot be easily managed while their quality is limited. They are heterogeneous and variable in nature coming in many formats e.g., text, document, image, video and more [8]. Such kind of data do not follow a predefined data model or schema.
According to Eurostat [10], data quality refers in the following aspects: (a) the characteristics of the statistical measurements on top of data; (b) the perception of the statistical measurements by the user; and (c) some characteristics of the statistical process. A metric/dimension, among others, that depicts data quality is accuracy [20]. As stated in [7], consistency and accuracy are the two central criteria for data quality. Accuracy refers to the closeness of estimates to the (unknown) exact or true values [22]. In other words, accuracy depicts the error between the observation/estimation and the real data. As accuracy refers in the closeness of data, it may also depict their ‘solidity’. In this paper, we consider that a ‘solid’ dataset will exhibit a high accuracy that is realized when the error/difference between the involved data is low. In a ‘solid’ dataset, the standard deviation of the data will be limited. Researchers have identified that accuracy is significant for the responses delivered to queries defined by users or applications. Efficient response plans, for each type of queries, may be defined when the accuracy of the underlying data is secured.
As noted, numerous devices produce huge volumes of data (e.g., terabytes, petabytes), thus, the usual approach is their separation into a number of partitions to facilitate their management. The separation of data could be imposed by the application domain (e.g., the application requires fast responses, thus, we may split the data to process them) or data could be reported by different streams in different locations. The number of partitions depends on the adopted separation technique (e.g., [13, 26, 28, 34, 35]) or the locations where the data are collected. The optimal partitioning of a dataset has already been investigated by the research community to deliver the optimal number of partitions when a dataset should be separated [15]. Partitions may be present in various locations, even around the World and they are stored in a set of servers. Data separation facilitates the parallel processing, however, a mechanism for data coordination is imperative. This distributed nature of the described approach imposes new challenges in data processing. Multiple data partitions are available and knowledge should be derived on top of this ecosystem. Our work is motivated by a scenario where multiple streams report data to a set of partitions requiring real time processing. Each data partition should be characterized by high accuracy (i.e., it should be a ‘solid’ dataset). Our mechanism tries to keep similar data to the same partitions to reduce the error/distance between them, thus, increase the accuracy. Actually, ‘solid’ datasets is the target of data separation algorithms as adopted by the research community. Such algorithms aim to provide a number of small nonoverlapping datasets and distributed on the nodes of the network [35]. The decision of allocating data to the available partitions is based on the statistical similarity of the incoming data with every partition. Keeping similar data into the same partitions is a kind of proactive ‘clustering’ to create the basis for the application of Machine Learning (ML) algorithms and the production of knowledge. Our motivation is to, finally, have a view on the statistical dispersion of data that will facilitate the generation of efficient response plans for the incoming queries.
A set of research efforts focus on the data quality management and they have identified its necessity in any application domain. However, they seldom discuss how to effectively validate data to ensure data quality [12]. The poor quality of data could increase costs and reduce the efficiency of decision making [29]. A major research question is how to integrate heterogeneous data that are stored, dispersed and isolated from one another [27]. The integration of such data demands for intelligent management that will extract and integrate data into a high level system. Social media are also defining more challenges. Efficient models should be incorporated to manage the quality of social media data in each processing phase of the big data pipeline [17]. Such approaches should efficiently work in real time when data are received through a set of streams. If data quality is not managed, it could often result in poor decisions, with individuals bearing the greatest risk [43]. However, risks may pose negative impact in the products delivered by companies. The practical orientation of a data quality assurance mechanism should be at the combination of offline and online methods to detect and replace doubtful data [2].
1.2 Contribution
We propose a preprocessing mechanism that, in real time, secures the quality of the available data. We depart from legacy solutions and instead of collecting huge volumes of data and postprocess them trying to derive knowledge, we propose their real time management and allocation keeping similar data to the same partitions. We offer a preprocessing distributed scheme that decides where data should be allocated. Other solutions involve the centralized collection of vast volumes data and, accordingly, for producing knowledge, the preparation of data before the postprocessing. Outliers, missing values or any other ‘harmful’ data should be efficiently managed through ML or data mining techniques. However, applying ML and data mining in large scale data is challenging and may require increased computational resources and time. In addition, outliers affect the performance of ML algorithms. An experimental study performed in [1] shows that the error in e.g., a classification process is reduced more than 25% when outliers are removed from the dataset. Our scheme acts proactively and tries to keep similar data in the same dataset preparing them beforehand to become the basis for knowledge production devoting limited resources and time in the postprocessing phase.
Our data quality assurance mechanism decides if the incoming data could jeopardize the accuracy of the available datasets. We focus on the accuracy and not on the consistency as we want to identify and manage the error between the incoming data and the available partitions. Moreover, consistency relates to the usability of data, a subject that is beyond the scope of this paper. Accuracy may be jeopardized when the incoming data significantly differ with the stored datasets. Initially, we want to identify the discussed difference that will be the basis for deciding the rejection or the acceptance of data. If we identify a significant difference, we reject the data. If the incoming data are similar to the available datasets, we aim to select the partition where the data should be allocated to maintain the accuracy at high levels (secure the ‘solidity’ of each partition). This is because the incoming data may not be similar to the partition where they are initially reported. A solution that stores the data where they are initially reported may affect data dispersion with negative effects in the accuracy. In addition, our approach secures that datasets will have the minimum overlapping that is also the target of legacy data separation techniques [35]. It should be noted that this paper does not aim to propose an integration model of the available data partitions. It aims at the correct allocation of the incoming data to one of the available partitions towards increasing the accuracy and, data quality respectively. The incoming data are organized in the form of vector of values where each value corresponds to a specific dimension/variable (multivariate scenario). The main research questions are: Q1. How to maintain the accuracy of each data partition?Q2. Taking into consideration the data and their statistics present in each partition, which partition is the appropriate to store an incoming vector?
We consider that in each partition, a processor is devoted to perform simple processing tasks (i.e., the management of the present and incoming data). Processors adopt our distributed model that incorporates the strategy for the identification of any accuracy violation event. We propose a distributed Accuracy Maintenance Scheme (AMS) responsible to identify violation events in any partition of the ecosystem. In cases of violation, the AMS can reject the incoming data keeping the accuracy of the corresponding partition at high levels. It is important that the AMS identifies violations in the entire ecosystem. A probabilistic approach is adopted on top of statistical measurements derived for each partition. These measurements are exchanged between processors. If a violation is not present, the AMS decides the partition that closely ‘matches’ to the incoming data, thus, data will be allocated there. The selection of the partition is performed by an uncertainty management scheme in terms of Fuzzy Logic (FL). We provide an FL controller responsible to derive the appropriate partition for each vector. The controller is responsible to manage/command the selection mechanism based on a FL knowledge base.

The proposed scheme proactively ‘prepares’ the data before algorithms for knowledge extraction are applied. Hence, we save time as no data preparation is necessary after their collection. For instance, it is not necessary to apply an outlier detection algorithm on top of huge volumes of data but to identify if outliers are present when data are collected from the environment and exclude them immediately.

The proposed model proactively secures the quality of data as it excludes data that may lead to an increased error during knowledge production. For instance, as shown through experiments [1], outliers may lead to an increased error when participating in the desired processing.

Our scheme leads to the minimum overlapping of the available datasets that is the target of the legacy data separation algorithms. However, this is realized in a preprocessing step placing the incoming data at the appropriate partition.
2 Related work
A survey of data quality dimensions is presented in [37] while, at the same time, the authors propose a framework that combines data mining and statistical techniques to extract the correlation of these dimensions. The aim is to measure the dependencies and figure the methods for creating new knowledge. In [29], the authors propose the use of a model that consists of nine determinants of quality. From them, four are related to information quality and five describe system quality. The determinants of quality are: (a) Information quality: accuracy, completeness, currency, format; (b) System quality: accessibility, reliability, response time, flexibility, integration. It should be noted that the assessment of dimensions could be taskindependent or taskdependent [31]. Taskindependent dimensions represent scenarios where data are evaluated without the knowledge of the application domain. Hence, such metrics may be applied in any dataset. Taskdependent dimensions incorporate requirements of the organization and the application domain.
Recently, the advent of the largescale datasets defined more requirements on the data quality assessment. Given the range of big data applications, potential consequences of bad data quality can be more disastrous and widespread [32]. In [25], the authors propose the ‘3As Data QualityinUse model’ composed of three data quality characteristics i.e., contextual, operational and temporal adequacy. The proposed model could be incorporated in any large scale data framework as it is not dependent on any technology. A view on the data quality issues in big data is presented in [32]. A survey on data quality assessment methods is discussed in [6]. Apart from that, the authors present an analysis of the data characteristics in large scale data environments and describe the quality challenges. A framework dealing with hierarchical data quality assessment is also proposed. The provided framework consists of big data quality dimensions, quality characteristics, and quality indexes. The evolution of the data quality issues in large scale systems is the subject of [5]. The authors discuss various relations between data quality and multiple research requirements. Some examples are: the variety of data types, data sources and application domains, sensor networks and official statistics.
Other applicationspecific large scale data quality assessment methods are as follows. In [42], a high number of data quality issues are identified based on the ‘testbed’ of the Vrije Universiteit, Brussels. The use case aims to reveal data quality dimensions and prioritize cleaning tasks in different dimensions. Apart from that, another goal is to facilitate the use of dimensions from users that do not have domain knowledge. The result is to setup the basis for providing automated mechanisms for data quality evaluation on top of multiple correlated dimensions. In [40], the authors present results and practical insights from a large scale study conducted in the telecommunications industry. The case study shows how the requirements for data quality can be collected to define an architecture adopted for the quality assessment. The authors also propose a data quality approach that combines data quality and data architectures into a framework where a set of steps, processes and tools are adopted. In [3], the authors discuss data quality in health information systems. In the beginning, a review of the relevant literature is presented and, accordingly, data quality dimensions and assessment methodologies in the health domain are discussed. The results of the research show that, among a total of fifty dimensions, eleven are identified as the main dimensions. Widely adopted dimensions are: completeness, timeliness, and accuracy. The authors in [33] describe the outcomes of a Workshop titled ‘Towards a common protocol for measuring and monitoring data quality in European primary care research databases’. The aim is to point out the most significant issues that affect data quality in databases from the perspective of researchers, database experts and clinicians. Multiple ideas were exchanged on what data quality metrics should be available to researchers.
The above discussion reveals that, among the identified data quality dimensions, accuracy plays an important role. For defining accuracy in the data, multiple methods are adopted (e.g., clustering). In [18], the main scenario is a setting where multiple sensor nodes report their values. A distributed clustering algorithm is proposed that adopts the spatial data correlation among sensors. Data accuracy is performed for each distributed cluster. The identified clusters are not overlapped and incorporate different sizes to collect most accurate or precise data at each cluster head. The provided accuracy model is compared to results from an information accuracy model. Fuzzy logic is applied in [38]. The authors propose a distributed fuzzy clustering methodology for identifying data accuracy. The fuzzy clustering model is accompanied by a facilitator model to define a novel distributed fuzzy clustering method.
Research efforts that are close to our work are presented in [2, 17, 27]. These efforts discuss models for the management of the data either off or online to secure their quality when large scale data are taken into consideration. Outliers and fault detection accompanied by autoregressive models, on top of streams, are adopted to evaluate the data quality [2]. In a high level, business decision making techniques undertake the responsibility of validating the data as they arrive [17]. The aforementioned efforts do not take into consideration the presence of multiple data partitions and their management. Multiple partitions are the subject of the research presented in [27]. However, the authors propose an integration scheme opposite to our work where we aim to provide means for assigning data to the underlying partitions.
3 Preliminaries
Our aim is to identify if \({\mathbf {x}}\) deviates from the ecosystem of partitions and, if not, to ‘allocate’ \({\mathbf {x}}\) to the ‘appropriate’ partition. With the term ‘appropriate’, we denote the partition where statistics of \(\mathbf {x_{i}}\) ‘match’ to \({\mathbf {x}}\). Let us give a specific example. Suppose we have available two (2) partitions and our data vectors consist of two (2) variables (without loss of generality, we consider numeric values for both variables). In the first partition, the mean vector is \(\left\langle 0.2, \,1.2 \right\rangle \) while in the second partition the mean vector is \(\left\langle 1.8, 1.7 \right\rangle \). Initially, we receive the vector \(\langle \,4.3, 4.5 \rangle \). This vector based on the Mahalanobis distance may be an outlier [21]. The inclusion of this vector in each of the two datasets will ‘destroy’ the means and may lead to false knowledge extraction as already explained. Hence, the vector is rejected by the proposed scheme as it does not fit to the ecosystem of the available datasets. Suppose now we receive the vectors \(\langle 0.1, \,0.6 \rangle \) and \(\langle 1.7, 2.0 \rangle \). Both vectors are not outliers, thus, the first can be placed at the first partition while the second can be placed at the second partition updating their statistics. It should be noted that the proposed mechanism considers a number of vectors present in every dataset. A warm up period may be adopted to initially store the first incoming vectors into the discussed datasets. At this point, we could adopt a specific strategy concerning the location where each vector will be stored, e.g., a centralized approach will conclude the partition for each vector (such a centralized approach will not add a significant overhead as the warm up period ‘delivers’ a limited number of vectors). However, in the case where the first incoming vectors are outliers compared to the vectors that will arrive in the future, we may conclude wrong statistics in our calculations. Hence, the proposed mechanism can be adopted as the second step after a preprocessing phase that refers in the collection of vectors in the warm up period and the initial exclusion of the outliers present in the available datasets.
In the current research, we focus on a ‘strict’ mechanism that when identifies an increased error between the incoming data and the entire set of the available partitions decides that the data quality is jeopardized (i.e., accuracy violation event). A model that takes into consideration the natural evolution of data in the error identification is left as a future work. For simplicity, we consider that data stored in the envisioned partitions are generated by a Normal distribution. The proposed mechanism checks the statistical similarity of \({\mathbf {x}}\) with every partition and decides if \({\mathbf {x}}\) will be rejected or kept for storage in a partition. The rejection is made when \({\mathbf {x}}\) does not match to the statistics of any partition in the ecosystem.
As noted, processors placed in front of each partition have the responsibility of managing the incoming vectors and performing simple statistical calculations (e.g., extraction of mean and variance). We consider that in each partition there is knowledge on the mean and variance vectors in the remaining partitions. The mean vector depicts the mean of each variable as realized in the corresponding dataset. A similar rationale holds true for the variance vectors. The calculation is performed at predefined intervals and, accordingly, the mean and variance vectors are sent to the peer partitions. Hence, the entire set of partitions has a view on the statistics of data stored in the entire ecosystem. Some problems may arise when specific parts of the network are unreachable. In such cases, the delay for transferring the vectors in the network will be huge probably making obsolete the current view on the statistics of the ‘invisible’ partitions. For solving this problem, we could combine the proposed mechanism with a monitoring or an aging mechanism that will be responsible to provide information for the unreachable parts of the network. The unreachable parts can be excluded from the envisioned processing supporting an adaptive model that will exclude/include partitions based on the network’s performance. However, this aspect is beyond the scope of the current effort. In any case, the proposed model can be efficiently adopted in EC applications considering nodes in close proximity and a good connectivity to be able to exchange the discussed information. Through this approach, we aim to provide a ‘global’ outlier detection scheme that secures the accuracy in the entire ecosystem no matter the location where the incoming vectors arrive. Instead of recording the entire set of data and, afterwards, removing the outliers and separate the data into a set of non overlapping partitions, we provide a proactive mechanism that, in real time, it decides if the incoming data should stored and where they should be placed. This way, we require the minimum time and lower amount of resources for preprocessing compared to the case where we should postprocess vast volumes of data and split them into the discussed partitions.
Our AMS consists of two parts, i.e., the Accuracy Violation Detection Scheme (AVDS) and the Partition Identification Scheme (PIS). The AVDS aims to identify accuracy violation events on top of the collected mean and variance vectors. AVDS adopts a probabilistic model that derives the probability that \({\mathbf {x}}\) is generated by each partition. The assignment of \({\mathbf {x}}\) in a partition is PIS’s responsibility. PIS adopts an uncertainty driven decision making in terms of FL. An FL controller receives the result of the AVDS (a set of probabilities) and the distance of \({\mathbf {x}}\) with each partition. The controller depicts the strategy of selecting partitions through the conversion of a linguistic control method on top of a set of fuzzy control rules. The ‘allocation’ decision is based on the result of the FL controller that calibrates the probability of having \({\mathbf {x}}\) generated by each partition with the maximum distance realized for every variable in the datasets.
4 The distributed accuracy management scheme
4.1 The accuracy violation detection model
4.2 The selection process
For selecting the appropriate partition where \({\mathbf {x}}\) will be allocated, we propose the use of a Type2 FL controller [44]. It should be noted that the adopted controller accompanied by its knowledge base (in terms of rules) is triggered just after the decision for the acceptance of \({\mathbf {x}}\). It does not require any initialization process but just to feed it with the envisioned inputs. A Type2 FL controller is an FL controller that adopts Type2 fuzzy sets in the definition of membership functions for each input and output variable. Membership functions in a Type2 FL controller are intervals defining the upper and the lower bounds for each fuzzy set [23] (explained later). The area between the two bounds is referred to as Footprint of Uncertainty (FoU). The controller defines the knowledge base of the proposed scheme in the form of a set of Fuzzy Rules (FRs). Such FRs try to manage the uncertainty related to if \({\mathbf {x}}\) closely matches to each partition \(D_{i}\). FRs refer to a nonlinear mapping between two inputs: (1) \(P({\mathbf {x}}, D_{i})\) and (2) \(L_{\infty }({\mathbf {x}},D_{i})\) and a single output, i.e., the \(\omega _{i}\). The antecedent part of FRs is a (fuzzy) conjunction of inputs and the consequent part of the FRs is the \(\beta \) parameter indicating the belief that an event actually occurs, i.e., the belief that a specific partition should host \({\mathbf {x}}\). The proposed FRs have the following structure:
IF\(P({\mathbf {x}}, D_{i})\) is \(A_{1k}\)AND\(L_{\infty }({\mathbf {x}},D_{i})\) is \(A_{2k}\)
THEN\(\omega _{i}\) is \(B_{k}\), where \(A_{1k}, A_{2k}\) and \(B_{k}\) are membership functions for the kth FR mapping \(P({\mathbf {x}}, D_{i})\), \(L_{\infty }({\mathbf {x}},D_{i})\) and \(\omega _{i}\) (values into unity intervals), by characterizing their values through the terms: low, medium, and high. \(A_{1k}, A_{2k}\) and \(B_{k}\) are represented by two membership functions corresponding to lower and upper bounds [23]. For instance, the term ‘high’, whose membership for \(P({\mathbf {x}}, D_{i})\) is a number \(z(P({\mathbf {x}}, D_{i}))\), is represented by two membership functions. Hence, \(P({\mathbf {x}}, D_{i})\) is assigned to an interval \([z_{L}(P({\mathbf {x}}, D_{i})), z_{U}(P({\mathbf {x}}, D_{i}))]\) corresponding to a lower and an upper membership function \(z_{L}\) and \(z_{U}\), respectively. The interval areas \([z_{L}(P({\mathbf {x}}, D_{i})_{j}), z_{U}(P({\mathbf {x}}, D_{i})_{j})]\) for each \(P({\mathbf {x}}, D_{i})_{j}\) reflect the uncertainty in defining the term, e.g., ‘high’, useful to determine the exact membership function for each term.
Without loss of generality, we assume that \(P({\mathbf {x}}, D_{i}), L_{\infty }({\mathbf {x}},D_{i}) \in [0,1]\). We also define \(\omega _{i} \in [0,1]\). A \(\omega _{i}\) close to unity denotes the case where the corresponding partition is similar with the incoming vector \({\mathbf {x}}\). The opposite stands when \(\omega _{i}\) tends to zero. The aim of the proposed mechanism is to secure that the distribution of the data will not be considerably changed. The envisioned PIS is responsible to deliver the most appropriate partition where a vector will be stored. PIS’s inputs are related to the statistical similarity between the incoming vector and the available partitions. Hence, the proposed module decides having the goal of keeping the changes in the distribution of the data limited as it allocates the incoming vectors to the most similar datasets. For inputs and the output, we consider three linguistic terms: Low, Medium, and High. Low represents that a variable (input or output) is close to zero, while High depicts the case where a variable is close to one. Medium depicts the case where the variable is around 0.5. For each term, human experts define the upper and the lower membership functions. Here, we consider triangular membership functions as they are widely adopted in the literature.
The knowledge base of the proposed controller
Rule  \(P({\mathbf {x}}, D_{i})\)  \(L_{\infty }({\mathbf {x}},D_{i})\)  \(\omega _{i}\) 

1  Low  Low  Low 
2  Low  Medium  Medium 
3  Low  High  Medium 
4  Medium  Low  Medium 
5  Medium  Medium  Low 
6  Medium  High  Medium 
7  High  Low  High 
8  High  Medium  Medium 
9  High  High  Medium 
5 Experimental evaluation
5.1 Performance metrics and simulation setup
We report on the performance of the proposed mechanism through the adoption of a set of performance metrics. Our main target is to reveal the ‘solidity’ of the datasets as realized after the management of multiple vectors. A high ‘solidity’ depicts a low error in the data present in each partition. According to [41], erroneous data can heavily affect analysis, leading to biased inference. Hence, in this paper, we consider that the quality of the data is represented by the data accuracy, i.e., the lowest possible error among the data and, thus, the lowest possible error between the incoming vectors and the available partitions. Actually, in our paper, accuracy is defined by the closeness to the already present values. In our evaluation plan, for identifying the quality of the data stored in each partition, we rely on statistical metrics like the standard deviation. Standard deviation is a measure of the dispersion of a dataset from its mean (i.e., error between the data as depicted by the mean). A low value of the standard deviation shows that data are around their mean, thus, a high ‘solidity’ is identified.
We adopt two types of datasets, i.e., a synthetic and a real. The synthetic trace is generated through the adoption of the Gaussian distribution. We simulate the production of 10,000 vectors and randomly select in the interval [1,N] the initial partition where each vector is reported. For each variable in the incoming vector, we generate a random value in the interval [0, 100]. The real dataset comes from National Agency for New Technologies, Energy and Sustainable Economic Development.^{2} The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. Data were recorded from March 2004 to February 2005 (1 year) representing the longest freely available recordings of on field deployed air quality chemical sensor responses. From this dataset, we adopt the measurements of four parameters, i.e., hourly averaged concentration CO, temperature in Celsius , relative humidity and absolute humidity. We consider that each parameter corresponds to a partition where the collected values should be finally allocated. As the initial partitions are only four, we replicate them to deliver a dataset that contains more partitions. In each round of the performed simulations, we randomly select (1) a random trace from the available; (2) a random row in this trace and we reason about the acceptance and the allocation of the selected row.
The initialization of the proposed scheme refers in the separation of the adopted traces in a number of datasets/partitions. As far as the FL controller concerns, we define the membership functions and the FL rules adopting the Juzzy Fuzzy Logic toolkit [45] as depicted by the Table 1. We perform an extensive set of simulations for \(N \in \left\{ 5, 10, 50, 100 \right\} \) and report on our results. We simulate the exchange of the mean and variance vectors through the adoption of the \(\phi \) parameter. We consider that \(\phi \in \left\{ 0.2, 0.8 \right\} \). Mean and variance vectors are exchanged in the network with probability \(\phi \). When \(\phi = 0.2\), we simulate a low exchange rate, i.e., the mean and variance vectors are not frequently exchanged between the partitions. It becomes obvious that a low exchange rate limits the number of messages in the network, however, a low \(\phi \) may have consequences in the statistical measurements performed by the proposed mechanism. When \(\phi = 0.8\), the exchange rate is high and it leads to an increased number of messages in the network.
5.2 Performance assessment
We report on the comparison between the proposed model and an outlier detection scheme. As we focus on the performance of the ’outlier detection aspect’ of the proposed model (the first part of our scheme), we rely on a single variable for the evaluation. We compare our model with the CumSum algorithm [30] that is widely adopted for outliers detection. We focus on the synthetic trace and perform experiments for \(N \in \left\{ 10, 50, 100 \right\} \). Recall that the synthetic trace involves 10,000 vectors with random values for each variable. Our scheme’s results incorporate 9405, 1526 and 430 outliers for \(N \in \left\{ 10, 50, 100 \right\} \), respectively. The results of the CumSum are 9932, 9924 and 9908 outliers for the same N realizations. The CumSum algorithm exhibits worse performance (as it characterizes the majority of data as outliers) than the proposed model as it relies on the difference of the incoming vectors with the distribution of the mean values as calculated in each partition. The adoption of a dataset that corresponds to a very dynamic environment (as depicted by the synthetic trace) where values change continuously, it negatively affects the performance of the CumSum. The proposed model is positively affected by the increased N. The number of vectors characterized as outliers is limited (compared to the total number of vectors) when \(N \rightarrow 100\).
Results when communication problems are present
N  \(\lambda =10\)%  \(\lambda =50\)%  

\(\gamma \)  \(\delta \)  \(\gamma \)  \(\delta \)  
10  88.14  71.70  90.27  75.29 
50  97.94  61.67  97.95  68.97 
100  99.00  70.41  99.21  73.07 
In order to compare our model with a centralized approach, we perform a set of experiments to deliver the time required for concluding the decision process for a single vector. Recall that the proposed model manages the incoming vectors at the time they are reported in a node. A centralized approach should wait to collect all the reported vectors, identify and eliminate the outliers and, finally, to separate data into a number of partitions. For centralized approaches, we focus on the summary of the time required for the last two steps (outliers detection and data separation). It becomes obvious that our model is not affected by the time required for the first step as the incoming vectors are processed immediately. Starting from the data separation techniques, in [13], we can see that the authors compare three data separation techniques. The first scheme is proposed by the authors and it firstly partitions the data along their feature space, and apply the parallel block coordinate descent algorithm for distributed computation; then, it continues to partition the data along the sample space, and adopt a novel matrix decomposition and combination approach for distributed processing. The authors evaluate the model for four (4) datasets; two of them are adopted for classification purposes (D1, D2) and two are adopted for regression (D3, D4). The average time requirements (we provide the results depicting the time required per vector) are: (1) for D1: 1.61 s; (2) for D2: 0.0059 s; (3) for D3: 0.00059 s; (4) for D4: 0.00047 s. In addition, the authors provide results for two more models, i.e., the Parallelizing Support Vector Machines on Distributed Computers (PSVM) model [46] and the ConsensusBased Distributed Support Vector Machines (CBDSVM) model [11]. The results for the PSVM are: (1) for D1: 0.0046 s; (2) for D2: 0.0139 s; (3) for D3: 0.0051 s; (4) for D4: 0.0044 s. The CBSVM results are as follows: (1) for D1: 0.00064 s; (2) for D2: 0.0162 s; (3) for D3: 0.0027 s; (4) for D4: 0.0025 s. In the time required for data separation, we should add the time devoted to the preprocessing and the outliers detection steps. In [16], the authors provide a comparison for various outlier detection techniques. In these results, the average required time per vector fluctuates from 0.000023 to 0.00109 s. In our model, the total processing time per vector is 0.0036 s for the synthetic trace and 0.000075 s for the real trace. We observe that when we adopt the real trace, the proposed model outperforms all the aforementioned techniques.
6 Conclusions and future work
Current ICT applications involve huge volumes of data produced by numerous devices. Data are reported through streams and stored in multiple locations. For facilitating the parallel management, a number of data partitions are available where processing tasks are realized. Our effort aims to secure the quality of data stored in each partition through the management of data accuracy. We propose the use of a probabilistic and an uncertainty management mechanism that decides if the incoming data ‘fit’ to the available partitions. The uncertainty is managed through a fuzzy logic controller to derive the final decision related to the allocation of data. The proposed mechanism checks the similarity of the incoming data with the available partitions and if the accuracy is secured, it decides the appropriate partition where data will be allocated. Our mechanism is distributed, thus, the incoming data may be transferred to the correct partition. The proposed approach is characterized by simplicity while being capable of identifying the correct partition for placing the data. The aim is to reduce the dispersion of data, thus, to increase the accuracy and data quality. Our future research plans involve the definition of an intelligent scheme for selecting the most significant variables in the incoming data to reduce the dimensions and enhance the efficiency of the approach. In addition, the cost of the allocations will be studied especially when data should be transferred in another partition compared to the partition where they are initially reported. Our plans also involve the study of the trade off between the quality of data and the cost of the allocation in a remote partition.
Footnotes
Notes
Acknowledgements
This work is funded by the EU/H2020 Marie SklodowskaCurie Action Individual Fellowship (MSCAIF2016) under the Project INNOVATE (Grant No. 745829).
References
 1.Acuna E, Rodriguez C (2005) An empirical study of the effect of outliers on the misclassification error rate. Trans Knowl Data Eng 17:1–21CrossRefGoogle Scholar
 2.Alferes J, Poirier P, LamaireChad C, Sharma AK, Mikkelsen PS, Vanrolleghem PA (2013) Data quality assurance inmonitoring of wastewater quality: univariate online and offlinemethods. In: Proceedings of the 11th IWA conference on instrumentation control and automation. pp 18–20Google Scholar
 3.Alipour J, Ahmadi M (2017) Dimensions and assessment methods of data quality in health information systems. Acta Med Mediterr 33:313–320Google Scholar
 4.Arapoglou R, Kolomvatsos K, Hadjiefthymiades S (2010) Buyer agent decision process based on automatic fuzzy rules generation methods. In: Proceedings of the 2010 IEEE world congress on computational intelligence (WCCI 2010), FUZIEEE, July 18th–23rd, Barcelona. pp 856–863Google Scholar
 5.Batini C, Rula A, Scannapieco M, Viscusi G (2015) From data quality to big data quality. J Database Manag 26(1):60–82CrossRefGoogle Scholar
 6.Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14(2):1–10Google Scholar
 7.Cong G, Fan W, Geerts F, Jia X, Ma S (2007) Improving data quality: consistency and accuracy. In: Proceedings of the VLDB, Vienna, Austria. pp 1–12Google Scholar
 8.Das TK, Kumar PM (2013) Big data analytics: a framework for unstructured data analysis. Int J Eng Technol 5(1):153–156MathSciNetGoogle Scholar
 9.Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotechnol 26(8):897–899CrossRefGoogle Scholar
 10.Eurostat (2007) Handbook on data quality assessment methods and tools. European Commission, LuxembourgGoogle Scholar
 11.Forero P, Cano A, Giannakis G (2010) Consensusbased distributed support vector machines. JMLR 11:1663–1707MathSciNetzbMATHGoogle Scholar
 12.Gao J, Xie C, Tao C (2016) Big data validation and quality assuranceissues, challenges and needs. In: Proceedings of the IEEE symposium on serviceoriented system engineering (SOSE). https://doi.org/10.1109/SOSE.2016.63
 13.Guo H, Zhang J (2016) A distributed and scalable machine learning approach for big data. In: Proceedings of the 25th international joint conference of artificial intelligence, New YorkGoogle Scholar
 14.Han J, Kamber M, Pei J (2012) Data mining, concepts and techniques, 3rd edn. Morgan Kaufmann Publishers, BurlingtonzbMATHGoogle Scholar
 15.Halkidi M, Varzigiannis M (2001) Clustering validity assessment: finding the optimal partitioning of a dataset. In: Proceedings of the IEEE international conference on data mining, San Jose, USA,Google Scholar
 16.Hasani Z (2017) Robust anomaly detection algorithms for realtime bigdata: comparison of algorithms. In: Proceedings of the 6th Mediterranean conference on embedded computing (MECO)Google Scholar
 17.Immonen A, Paakkonen P, Ovaska E (2015) Evaluating the quality of social media data in big data architecture. IEEE Access 3:2028–2043CrossRefGoogle Scholar
 18.Karjee J, Jamadagni HS (2011) Data accuracy model for distributed clustering algorithm based on spatial data correlation in wireless sensor networks. Networking and internet architecture. arXiv:1108.2644
 19.Last M, Kandel A (2001) Automated detection of outliers in realworld data. In: Proceedings of the 2nd international conference on intelligent technologiesGoogle Scholar
 20.Loshin D (2011) Monitoring data quality performance using data quality metrics. Informatica, The Data Integration Company, white paperGoogle Scholar
 21.Majewska J (2015) Identification of multivariate outliersproblems and challenges of visualization methods, Studia Ekonomiczne. Zeszyty Naukowe, Uniwersytetu Ekonomicznego w Katowicach, No 247Google Scholar
 22.Management Group on Statistical Cooperation (2014) Report of the sixteenth meeting. European Commission, Eurostat, vol Doc., p MGSC/2014/14Google Scholar
 23.Mendel JM (2007) Type2 fuzzy sets and systems: an overview. IEEE Comput Intell Mag 2(2):20–29CrossRefGoogle Scholar
 24.Mendel JM (2001) Uncertain rulebased fuzzy logic systems: introduction and new directions. PrenticeHall, Upper Saddle RiverzbMATHGoogle Scholar
 25.Merino J, Caballero I, Rivas B, Serrano M, Piattini M (2016) A Data quality in use model for big data. Future Gener Comput Syst 63:123–130CrossRefGoogle Scholar
 26.Mishra S, Suman AC (2016) An efficient method of partitioning high volumes of multidimensional data for parallel clustering algorithms. Int J Eng Res Appl 6(8):67–71Google Scholar
 27.Mohammed AO, Talab SA (2015) Enhanced extraction clinical data technique to improve data quality in clinical data warehouse. Int J Database Theory Appl 8(3):333–342CrossRefGoogle Scholar
 28.Navathe S, Ceri S, Wiederhold G, Dou J (1984) Vertical partitioning of algorithms for database design. ACM Trans Database Syst 9:680–710CrossRefGoogle Scholar
 29.Nelson RR, Todd PA, Wixom BH (2005) Antecedents of information and system quality: an empirical examination within the context of data warehousing. J Manag Inf Syst 21(4):199–235CrossRefGoogle Scholar
 30.Page ES (1954) Continuous inspection scheme. Biometrika 41(1/2):100–115MathSciNetzbMATHCrossRefGoogle Scholar
 31.Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218CrossRefGoogle Scholar
 32.Rao D, Gudivada VN, Raghavan VV (2015) Data quality issues in bigdata. In: Proceedings of the IEEE international conference on bigdata, Santa Clara, CA, USAGoogle Scholar
 33.Rosemary Tate A, Kalra D, Boggon R, Beloff N, Puri S, Williams T (2014) Data quality in European primary care research databases. Report of a workshop held in London September 2013. In: Proceedings of the IEEEEMBS international conference on biomedical and health informatics (BHI), Valencia, SpainGoogle Scholar
 34.Sacca D, Wiederhold G (1985) Database partitioning in a cluster of processors. ACM Trans Database Syst 10:29–56zbMATHCrossRefGoogle Scholar
 35.Salloum S, He Y, Huang JZ, Zhang X, Emara T (2018) A random sample partition data model for big data analysis. arXiv:1712.04146
 36.Shi W, Cao J, Zhang Q, Li Y, Xu L (2016) Edge computing: vision and challenges. IEEE Internet Things 3(5):637–646CrossRefGoogle Scholar
 37.Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012) Data quality: a survey of data quality dimensions. In: Proceedings of the international conference on information retrieval and knowledge management (CAMP), Kuala Lumpur, Malaysia, pp 300–304Google Scholar
 38.Son LH (2015) DPFCM: a novel distributed picture fuzzy clustering method on picture fuzzy sets. Expert Syst Appl 42(1):51–66CrossRefGoogle Scholar
 39.Truong H, Karan M (2018) Analytics of performance and data quality for mobile edge cloud applications. In: Proceedings of the IEEE international conference on cloud computing, workshop: cloud and the edge San Francisco, USAGoogle Scholar
 40.Umar A, Karabatis G, Ness L, Horowitz B, Elmagardmid A (1999) Enterprise data quality. Inf Syst Front 1(3):279–301CrossRefGoogle Scholar
 41.Urbano F, Basille M, Cagnacci F (2014) Data quality: detection and management of outliers, chapter. In: Spatial database for GPS wildlife tracking. Data: a practical guide to creating a data management system with Postgre SQL/Post GIS and R. SpringerGoogle Scholar
 42.Van den Berghe S, Van Gaeveren K (2017) Data quality assessment and improvement: a Vrije Universiteit Brussel case study. Procedia Comput Sci 106:32–38CrossRefGoogle Scholar
 43.Wigan MR, Clarke R (2013) Big data’s big unintended consequences. IEEE Comput 46(6):46–53CrossRefGoogle Scholar
 44.Wu D (2012) On the fundamental differences between interval type2 and type1 fuzzy logic controllers. IEEE Trans Fuzzy Syst 20(5):832–848CrossRefGoogle Scholar
 45.Wanger C (2013) Juzzy—a java based toolkit for type2 fuzzy logic. In: Proceedings of the IEEE symposium on advances in type2 fuzzy logic systems, SingaporeGoogle Scholar
 46.Zhu K, Wang H, Bai H, Li J, Qiu Z, Cui H, Chang E (2008) Parallelizing support vector machines on distributed computers. In: Proceedings of NIPS 20Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.