Similarity-aware data aggregation using fuzzy c-means approach for wireless sensor networks
- 80 Downloads
For resource-constrained IoT systems, data collection is one of the fundamental operations to reduce the energy dissipation of sensor nodes and improve the network lifetime. However, an anomaly or deviation will exert a great influence on the quality of data collected, especially for a data aggregation scheme. By taking into account data-aware clustering and detection of anomalous events, a similarity-aware data aggregation using a fuzzy c-means approach for wireless sensor networks is proposed. Firstly, by using a fuzzy c-means approach, the clustering process can be performed to organize sensors into clusters based on data similarity. Next, an effective support degree function is defined for further outlier diagnosis. Afterwards, the appropriate weight of valid data can be obtained by taking advantage of the probability distribution characteristics of normal samples within a certain period. Finally, the aggregation result in the cluster can be estimated. Practical database-based simulations have confirmed that the proposed data aggregation method can achieve better performance than traditional methods in terms of data outlier detection accuracy and relative recovery error.
KeywordsFuzzy c-means Data similarity Aggregation Wireless sensor networks
Quality of service
Relative recovery error
Similarity-aware data aggregation using fuzzy c-means approach
Wireless sensor networks
Wireless sensor networks (WSNs) are typically composed of many small and low-cost sensor nodes with resource constraints, such as low memory capacity, less computational complexity, low communication bandwidth, and limited power. This new type of network demonstrates the characteristics of low cost, wide distribution, small volume, and flexible self-organizing . With the rapid development, it has been successfully applied in the consumer electronics market and more and more widely used in the fields of target tracking, intelligent transportation, health prognosis, industrial automation, and so on. However, due to the WSN’s imperfect nature, the sensor nodes need to be deployed densely to compensate for the quality of data collected [2, 3]. Nonetheless, for process-monitoring applications, high frequent sensing and the transmission of readings result in a large number of redundant samples, which may lead to the waste of the node’s energy and bandwidth resource as well as the reduction of the network lifetime. Therefore, how to employ spatiotemporal correlation of the readings between sensor nodes and develop efficient data redundancy reduction for saving the energy of the sensors are urgent problems.
Data aggregation is an effective method to solve the above problems . The basic idea is to aggregate the samples of multi-sensors with a certain degree of redundancy rather than transmit raw data. It means that some nodes will act as aggregator to eliminate redundant data received from other sensor nodes and achieve desirable results for data accuracy. In practical application, the monitoring indicators, such as temperature, humidity, flow rate, or pressure, will demonstrate smooth and steady change in the majority of cases . Once a sudden event occurs, the surrounding sensor nodes are generally able to detect the situation and obtain the readings synchronously. Therefore, the samples with large deviations from individual nodes may have a greater impact on the overall fusion results and influence the quality of data collected . In this paper, we focus on spatio-temporal correlation of the readings in cluster-based WSNs. In particular, we cope with data-aware clustering and detection of anomalous events, and we use fuzzy c-means approach to organize sensors into clusters based on data similarity.
This study originates from the need for detecting spatial outliers in terms of the spatial correlations among neighboring sensor reading, which can get more accurate fusion results. Our approach uses the spatial temporal correlations of sensor’s samples to detect outliers locally.
We propose a novel similarity-aware data aggregation using fuzzy c-means approach for wireless sensor networks.
We propose a theoretical analysis to determine the optimization of cluster formation.
We conduct extensive simulations to demonstrate the performance of the algorithms. Simulation results show that our proposed method can achieve better performance than traditional methods in terms of data outlier detection accuracy and relative recovery error.
3 Related work
The traditional methods of data aggregation can be classified into two major categories: random theory-based and artificial intelligence-based approaches [7, 8]. The former includes the weighted average method, least square method, the Bayesian estimation, D-S evidence theory, and so on. The latter uses artificial neural network, fuzzy reasoning, or rough set to eliminate the anomalous data.
Izadi et al.  presented a fuzzy-based data fusion approach for WSNs to mitigate redundant data and reduce energy consumption. The authors utilized a fuzzy logic controller to obtain the confidence factor, and then the true value is distinguished and transmitted to the cluster head (CH) for multi-sensor data fusion. Fu  proposed double CHs model for secure and accurate data fusion, in which each cluster maintains dual CHs according to the reputation evaluation. All CHs make data fusion and transmit the results to the base station (BS), and the dissimilarity coefficient can be obtained by BS according to the fusion results. If the dissimilarity coefficient exceeds the threshold, the CH will be put into the blacklist and rotate the CH selection immediately. Xiang et al.  proposed a data aggregation method based on the compressive sensing theory. Particularly, they adopted diffusion wavelets to make the raw sensor data sparse to decrease the communication overload as well as the computational complexity.
Furthermore, there are several strategies proposed in order to mitigate the energy hole problem. Sun et al.  proposed a data aggregation method of wireless sensor networks using artificial neural networks. The data fusion tree is established to reduce the packets flow and can update the leaf nodes dynamically. Aikaraki et al.  introduced a joint design of data aggregation with the routing technology, and presented a grid-based routing and aggregator selection scheme to achieve low energy dissipation and low latency without sacrificing quality. By investigating data fusion with communication constraint between the fusion center and each sensor, Xu et al.  presented a data fusion mechanism for target tracking in wireless sensor networks based on quantized innovations and Kalman filtering. By adding some delay time, all the data collected by relay node can be fused at one time so as to reduce the energy consumption. Aiming to ensure the data quality, Li et al.  proposed various metrics for QoS (quality of service) in the process of data aggregation, including lifetime, data delay, and retransmission rate. Also, the approach is discussed to ensure above QoS metrics in details.
Moreover, data outliers give rise to a very important impact on the correctness of data fusion results and the efficiency of IoT systems. In order to ensure the correctness of fusion results, the data outliers caused by such as software defects, occasionally failed communication, low battery, or malfunction on hardware should be excluded to avoid impact on the aggregation results. Actually, most of the monitoring targets or the occurrence of external events usually will be random and unexpected. With regard to the data outliers from anomalous events, the readings should be identified exactly. Krishnamachari et al.  proposed a distributed algorithm for fault-tolerant event region detection in wireless sensor networks, which can determine whether a node is abnormal. Besides, by exploiting the anomaly probability from adjacent nodes, only a few bit messages are sufficient to achieve fault-tolerant localization as events occurred. Tan et al.  presented a prediction model of data flow based on linear autoregressive analysis and further proposed a real-time detection algorithm for outliers identification and compression processing. Fernandes et al.  propose an autonomous profile-based anomaly detection system using principal component analysis and flow analysis to mitigate the impact of false data injection. By making inference of end-to-end measurements collected by relay nodes, Zheng et al.  proposed a trust-assisted framework for detecting and localizing network anomalies in a hierarchical sensor network, which also can obtain a flexible tradeoff between inference accuracy and probing overhead. Hu et al.  presented outlier detection methods based on a neural network for WSNs, which exploited historical data to train the neural network to determine whether the actual measured value into the prediction interval so as to distinguish the data anomalies.
4 Network model and cluster formation
4.1 Network model
We consider a cluster-based architecture for a wireless sensor network, where all sensor nodes can monitor the given condition and periodically send its collected data to its CH. Most researches demonstrate that clustering is considered as an efficient topology control method in WSN to improve the scalability and lifetime of the whole system . By dividing the network, sensor nodes will be grouped into different clusters based on certain rules and each cluster has a cluster head . CH is responsible for managing the cluster and receiving the set of collected data from its member node during a certain period. Also, in order to improve the efficiency of data fusion, CH should have the ability to employ statistical detection based on the sensor readings. It can detect spatial outliers that deviate from normal data, thus ensuring the accuracy of data fusion.
1. At each period, the sensor nodes acquire the monitoring readings at a fixed sampling rate with m measures.
2. The original attribute information collected from the sensor nodes can be fuzzified into a set of membership functions.
3. In each cluster, the member nodes collect data in a periodic manner. Subsequently, all member nodes will send their data to the appropriate CH for data aggregation at the end of a round.
4.2 Cluster formation algorithm
In this section, we discuss the cluster formation based on data similarity by using fuzzy c-means approach. Compared to other topologies, cluster-based network topology is recently considered to be more effective for aggregating data packets separately. In addition, most of the existing data aggregation techniques based on clustering topology are dedicated to an event-driven data model. Many hierarchical cluster formation algorithms focus on the distance between nodes, residual energy, geographic coverage, and so forth. In contrast, the main purpose of our proposed method is to clear and ameliorate the collected data and provide the best information to end users . From a statistical point of view of the correlation, the perceived data of same time slot can demonstrate spatial-temporal correlation in the adjacent monitoring region. If the monitoring indicators of perceptual physical objects in the region do not show great fluctuation, there will be minimal deviation of the data collected by the sensor nodes with close geographical location . Therefore, cluster formation algorithm can make use of the spatial-temporal correlated environmental data and partition the adjacent sensor nodes with similar data instances into one cluster and different to objects in other groups.
The fuzzy c-means (FCM) algorithm was proposed by Bezdek  and has been used in cluster analysis, pattern recognition, image processing, and so forth. FCM is a clustering method derived from unsupervised learning, which uses fuzzy theory to divide a set of data points into a set of fuzzy clusters according to certain partitioning criteria . Suppose a WSN that consist of N-sensor nodes randomly distributed over an area of S × S meters. By using of the sensor’s respective geographical location and collected data initially, the BS computes the cluster centers and allocates sensor nodes to the clusters by applying the FCM algorithm.
5 Data aggregation
5.1 Data outlier detection
Outliers are often known as anomaly or deviation, which can even mislead systems into unsafe conditions. Whether the quality of data collected by WSNs is reliable and accurate or not will influence the performance of the whole system . Therefore, data outliers should be detected and isolated in time so as to ensure the validity of data aggregation result and fusion efficiency. For clustered WSN, it is impossible for CH to determine the validity of the data sent by its members. However, the geographical relationship between the readings of sensor nodes within a certain physical spatial range or cluster may be an effective means to identify outliers through credible tests of the masses. In this sub-section, an effective support degree function is defined for further outlier diagnosis is introduced, which is based on a standard statistical distribution model and makes use of the measures between neighboring nodes.
As mentioned above, due to the spatial-temporal correlation in the adjacent monitoring region, the measurements between node si and sj at the same sampling period will show relatively small differences. Hence, the support degree from node sj to si can be expressed as consistency between the samples Xi and Xj.
Since Ti is relative to the amount of support degree from other member nodes or its nearest local neighbors, it indicates normal level compared to the majority of sensor readings. When a sensor sends abnormal data due to noise errors or malicious attacks, the readings will obviously deviate from the measures of other sensors. As a result, its comprehensive support is very small. Unless a large area of intra-cluster nodes fail simultaneously, the probability of that exceptional case will be very low and can be neglected.
Suppose that the comprehensive support of data Xi is Ti, if Ti ≥ ζ, Xi is determined as normal data. Otherwise, the data will be regarded as outliers. Among them, the parameter ζ is set as the availability threshold value. When the value of Ti is less than the threshold value ζ, the corresponding readings Xi will be processed to mitigate the influence on the aggregation result.
5.2 Data aggregation strategy
In this section, we present the data aggregation strategy to ensure the accuracy of the aggregation result. Before aggregation process, the data being collected from member nodes will be sent entirely to the cluster head, which can conduct outlier detection based on the centralized approach. If the data being received is valid, it will be put into data aggregation process. Otherwise, they should be rejected immediately. Therefore, the two types of memory buffers can be embedded in CH and corresponding parameter a and b is set to count the number of normal and outlier data uploaded by each member node. Under certain conditions, the probability distribution of normal and outlier data will be approximate to the posterior probability distribution with binomial model, which can obey the beta distribution. Therefore, the beta distribution characteristics can be employed to evaluate the data validity.
The revised mathematical expectation can be defined as the weight value in process of data aggregation, and wi = Ei(χ) will be allocated to member nodes.
In this section, practical database-based simulations have been conducted to evaluate the performance of our method. Firstly, the datasets are derived from the real sensed data collected from 54 Mica2Dot sensors deployed in the Intel Berkeley Research Lab between February 28 and April 5, 2004 . The sensed data included humidity, temperature, light, and voltage values collected. In the experiments, we first selected some measurements of temperature from the sensor nodes 36 until 43, for the time period from March 18, 2004, to March 20, 2004, corresponding to 2000 log rows. We do not take into account the other features (humidity, light, and voltage). The quantity of data is about 2.3 million readings; it was collected using the TinyDB in-network query processing system, built on the TinyOS platform. Based on the dataset, we add a given mass of outliers to simulate the occurrence of events, which can make the data fluctuate to a certain extent.
500 × 500 m
Number of sensor nodes N
Node’s communication range
Normal path loss model
Data packet size
200 kilobytes per second
CC2420 radio layer
5%, 10%, 15%
In the experiment scenarios, outliers are simulated randomly, and 100 values of temperature are generated and then added to the dataset. In terms of the evaluation metrics, detection accuracy rate is defined as the ratio of outliers being detected to all outliers, and false alarm rate represents the ratio of normal data mistakenly detected as outliers.
The obtained results show clearly that applied support degree in such a way is very effective. It also maintains adaptability with different outlier probability. From the experiment results, it can be seen that the RRE curves of both KPFF and DSADC algorithms fluctuate dramatically. But we can still observe that nearly 90% RRE values of similarity-aware data aggregation using fuzzy c-means approach (SDAF) are below those of DSADC. The error of the fusion results obtained by SDAF is smaller than other methods especially as outlier probability increases. In the process of data aggregation, outlier samples can be identified effectively by diagnosis mechanism in SDAF, and the outlier-free readings are further aggregated and transmitted to the CH. Therefore, it can reduce the effectiveness to the aggregation result by data outliers and avoid the possibility of misleading systems into unsafe conditions.
To minimize the energy consumption by redundant data and reduce the expense of transmissions to the sink, data aggregation technology is very essential for WSNs. Data anomaly or deviation will exert a great influence on the quality of aggregated results. In this paper, we have proposed a similarity-aware data aggregation using a fuzzy c-means approach in clustered WSNs. By investigating the spatio-temporal correlations of sensor data and local detection of anomalous events, we presented a cluster formation algorithm based on fuzzy c-means approach. Then, we define an effective support degree function for further outlier diagnosis. Finally, based on statistical analysis of the outlier or outlier-free sensor data, the readings aggregation is conducted. Overall, the simulation results show that the proposed method can achieve better performance than traditional methods in terms of data outlier detection accuracy and relative recovery error.
In our future work, we plan to conduct the research on the analysis of outlier detection in terms of characteristics like the multi-dimension, detection mode, architectural structure, and correlation extraction.
The authors acknowledged the anonymous reviewers and editors for their efforts in valuable comments and suggestions.
This research was supported in part by the Hubei Provincial Educational Science Program (Grant No. 2018GB073) and the Guangxi Nature Science Fund (Grant No. 2016GXNSFAA380226).
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
WR proposes the innovation ideas and theoretical analysis, and XN carries out experiments and data analysis. HQ also wrote parts of the manuscript. SJ and WH participated in the coordination of the study and reviewed the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 12.L.Y. Sun, X.X. Huang, W. Cai, Data aggregation of wireless sensor networks using artificial neural networks. Chinese Journal of Sensors and Actuators. 24(1), 122–127 (2011).Google Scholar
- 15.H. Li, H.Y. Yu, Research on data aggregation supporting QoS in wireless sensor networks. Application Research of Computers. 25(1), 64–67 (2008).Google Scholar
- 17.Y.H. Tan, Y.P. Lin, T. Dong, Real-time detection algorithm for anomaly data in sensor networks. Journal of System Simulation. 19(18), 4335–4341 (2007).Google Scholar
- 19.S. Zheng, J.S. Baras, in 8th IEEE Communications Society Conference on Sensor, Mesh and ad hoc Communications and Networks(SECON). Trust-assisted anomaly detection and localization in wireless sensor networks (2011), pp. 386–394.Google Scholar
- 20.S. Hu, G.H. Li, W.W. Lu, Outlier detection methods based on neural network in wireless sensor networks. Computer Science. 41(11), 208–211 (2014).Google Scholar
- 24.X. Wang, Q. Li, N. Xiong, Y. Pan, in International Conference on Wireless Algorithms, Systems, and Applications (WASA 2018). Ant colony optimization-based location-aware routing for wireless sensor networks (2018), pp. 109–120.Google Scholar
- 25.F. Herrera, Genetic fuzzy systems: status, critical considerations and future directions. Int. J. Comput. Intell. Res. 5, 59–67 (2005).Google Scholar
- 26.N. Goyal, M. Dave, A.K. Verma, in Int. Conf. Electron. Commun. Syst. (ICECS). Fuzzy based clustering and aggregation technique for under water wireless sensor networks (2014), pp. 1–5.Google Scholar
- 30.Intel lab data home page. http://db.lcs.mit.edu/labdata/labdata.html. March 20, 2014.
- 31.P. Levis, N. Lee, M. Welsh, D. Culler, in Proc. of the 1st International Conference on Embedded Networked Sensor Systems. TOSSIM: accurate and scalable simulation of entire TinyOS applications (ACM Digital Library, Los Angeles, California, 2003), pp. 126–137.Google Scholar
- 32.H. Harb, A. Makhoul, D. Laiymani, in Proc. of the 10th IEEE International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob). K-means based clustering approach for data aggregation in periodic sensor networks (2014), pp. 434–441.Google Scholar
- 34.Y. Sang, H. Shen, Y. Tan, N. Xiong, in Proc. of International Conference on Information and Communications Security. Efficient protocols for privacy preserving matching against distributed datasets (2006), pp. 210–227.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.