1 Introduction

Wireless sensor networks (WSNs) (IanF et al. 2016) consist of many low-power wireless sensor nodes with limited storage and computing power. These sensor nodes are used to sense and collect useful information in the network. Currently, WSNs are frequently applied in habitat monitoring (Smita and Mrinal 2019), intelligent space, medical systems (Muhammad et al. 2020), Smart Grid (He et al. 2017) and robotic exploration. In WSNs, sensor nodes are easily captured, physical tampering, denial of service and other attacks, which may lead to a series of challenges in foundational researches. In the data collection process, these sensor nodes may generate some redundant data, and the further data transmission will consume extra energy. Data aggregation (Sabrina et al. 2018), as the crucial technology in WSNs, is widely used to overcome the energy consumption issue. Aggregators can calculate and count the sum, average, minimum and maximum values from the child sensor nodes, and send the aggregated result to higher-level aggregator. Through redundancy process and information synthesis, the network traffic and energy consumption will be decreased. However, in the data aggregation process, when sensor nodes are communicating with each other, anyone with a relevant wireless receiver can detect and intercept messages between sensor nodes. The attacker may use illegal means to communicate with powerful workstations (Yang et al. 2020; Wang et al. 2020a, b, c, d). Illegal interactions and information theft can cause severe harm to the network, and even propel the entire network into a state of paralysis.

In recent years, the researches in data aggregation and privacy protection can be divided into three categories: hop by hop encryption, end-to-end encryption and non-encryption mechanism. He et al. (2007) proposed CPDA and SMART privacy protection algorithm in terms of hop by hop encryption mechanism, which utilize TAG tree model (Madden et al. 2002) to aggregate the data from sensor nodes. Yao and Wn (2008) proposed the DADPP algorithm to meet the needs of different privacy levels, and reach the privacy protection while obtaining accurate aggregation results. Yang et al. (2008) introduced SDAP algorithm, which utilize probability grouping technology to effectively verify the correctness of aggregated data. Feng et al. (2008) utilized interference numbers to protect the privacy of and reduce the communication overhead of node data. He et al. (2009) proposed an iCPDA protocol to overcome the data integrity issue, which increases the data integrity protection and inherits the data privacy protection capability of the CDPA algorithm. Ozdemir and Yang (2008) proposed an IPHCDA protocol, which provides data privacy and integrity protection by employing homomorphic encryption algorithm based on elliptic curve encryption and MAC mechanism. Guo (2012) improved the CPDA scheme, and reduced the computational and communication costs. Based on iPDA and CPDA algorithms, Bista et al. (2012) proposed the DCIDA algorithm, which utilizes real part of complex number to protect data privacy and utilized the imaginary part to verify data integrity by introducing the concept of complex number. Guo and Ding (2014) proposed an ILCCPDA algorithm to reduce data transmission by utilizing the LEACH protocol and simple aggregation methods. This algorithm can detect data integrity by adding homomorphic message authentication code. In recent years, some privacy protection schemes (Liu et al. 2020a, b; He et al. 2019) in WSNs have been proffered. Zhang et al. (2019) proposed an energy efficient and reliable in-network data aggregation scheme for WSNs. Man et al. (2017) proposed an energy-efficient cluster-based privacy data aggregation (E-CPDA), Fang et al. (2017) proposed a novel energy-efficient secure data aggregation scheme cluster-based private data aggregation (CSDA). These literatures have proffered the data aggregation protocol in smart grid (Afshin et al. 2019; Kong et al. 2020; Liu et al. 2018), Secure multi-party computation (Wang et al. 2020a, b, c, d) can also be applied to data aggregation in the future. These literatures have reduced traffic, but for higher privacy needs, it needs to be a further improvement.

In this work, we focus on the data traffic and privacy issues in CPDA, and proposed a secure and efficient data aggregation privacy protection algorithm (SECPDA). The SECPDA algorithm utilizes SEP protocol to select the cluster head node dynamically, and utilizes false message to enhance privacy protection capability. Experimental evaluations and performance analysis show that the proposed SECPDA algorithm has a lower data traffic, high security, privacy protection ability than other algorithms.

The contributions of our work in this article are shown as follow:

  1. 1.

    We adopt the SEP protocol to dynamically select cluster head nodes and merge them into the simple addition cluster to reduce the communication overhead of data, and propose a secure and efficient data aggregation privacy protection algorithm (SECPDA).

  2. 2.

    We adopt the method of data slicing and node false message to aggregate data for a further improvement the privacy needs in the communication process.

The rest of the paper is structured as follows: Sect. 2 describes the model and background, including sensor network model, encryption method and clustering method. Section 3 presents the SECPDA algorithm, including the formation of the cluster, aggregation process and the SECPDA algorithm flow. Section 4 describes the results of simulation and performance analysis, including the simulation of clustering process of sensor nodes, privacy performance analysis and data traffic analysis. Section 5 summarizes the paper and layout future research.

2 Model and background

2.1 Sensor network model

Generally, sensor nodes are divided into three categories: base station, cluster head node and sensor node (He et al. 2007). The sensor node uploads the data to the cluster head node, then the cluster head node aggregates the received data with its own data, finally uploads the aggregated results to the base station. The sensor network model is demonstrated in Fig. 1

Fig. 1
figure 1

Sensor network model

2.2 Encryption method

The encryption method adopted by SECPDA is the same as that adopted by CPDA, which employs random key distribution scheme (Laurent and Virgil 2002). First, generate a key pool containing \(K\) keys, each of them has its own identity. Then each node randomly selects \(k\) keys from the key pool and stores them in the node. Each node broadcasts its own key, if the neighbor node has a public key with it, they will share a security link. Therefore, the probability of any two nodes sharing the security link is demonstrated in Formula (1):

$$ P_{connect} = 1 - \frac{{((K - k)!)^{2} }}{(K - 2k)!K!}. $$
(1)

If the public key is not found between two nodes, the intermediate node forms a secure link between the two nodes by means of multiple hop links. In this process, the probability that the shared key is eavesdropped by the attacker is demonstrated in Formula (2):

$$ P_{overhear} = k{/}K. $$
(2)

Here, \(K\) is the total number of keys in the key pool, \(k\) represents the number of keys randomly selected for each node.

2.3 Clustering method

SEP protocol (Smaragdakis et al. 2004) is utilized to cluster in our proposed SECPDA algorithm. SEP protocol is an improved clustering protocol based on LEACH protocol (Heinzelman et al. 2000). In SEP protocol, owning to the different initial energy, these sensor nodes is divided into two categories: advanced nodes and normal nodes. In addition, different thresholds are set to make these advanced nodes to be more frequently selected as the cluster head nodes. Therefore, the build and transport process of the cluster is a cycle, and there exists a random proportion of advanced nodes. If the proportion of advanced nodes is \(a\), the number of advanced nodes in the network with \(n\) nodes is \(na\). Then, we can find that the number of normal nodes is \((1 - a)n\). Respectively, if the initial energy of the advanced node is \(b\) times than the normal nodes’ energy. Each node generates a random number r and the range of r is from 0 to 1. If the threshold \(T(n)\) is greater than the random number r, the node is selected as the cluster head, and other nodes add the corresponding cluster according to the signal strength to complete the cluster construction. The probability of the advanced node and normal node being selected as cluster head is \(p_{1}\), \(p_{2}\), as demonstrated in Formulas (3) and (4):

$$ p_{1} = \frac{p}{1 + ba}(1 + b) $$
(3)
$$ p_{2} = \frac{p}{1 + ba}. $$
(4)

Here, \(p\) represents the proportion of the heads in the clusters, and the thresholds for advanced nodes and normal nodes are demonstrated in Formulas (5) and (6):

$$ T(n_{1} ) = \left\{ {\begin{array}{*{20}l} {\frac{{p_{1} }}{{1 - p_{1} \left[ {r\,{\text{mod }}\left( {\frac{1}{{p_{1} }}} \right)} \right]}}} \hfill & {{\text{if}}\,n \, \in \, G_{1} } \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(5)
$$ T(n_{2} ) = \left\{ {\begin{array}{*{20}l} {\frac{{p_{2} }}{{1 - p_{2} \left[ {r{\text{ mod }}\left( {\frac{1}{{p_{2} }}} \right)} \right]}} \, } \hfill & {{\text{if}}\,n \, \in \, G_{2} } \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.. $$
(6)

Here, \(r\) is the number of rounds, \(G_{1}\), \(G_{2}\) represents a set nodes that are not elected as cluster heads in the nearest round of these sensor nodes. In the transmission phrase, the node sends the collected data to the cluster head, which aggregates the data of all nodes, and then sends the aggregated result to the sink node. After a period of stabilization, the network proceeds to the next round of elections.

3 Secure and efficient privacy-preserving data aggregation algorithm SECPDA

The functionalities of these components are demonstrated in Table 1.

Table 1 Symbol description

3.1 The formation of the cluster

We utilize SEP protocol to select the cluster head node. Suppose there are 1 base station node and 10 sensor nodes in the network. When the base station sends a data request, the node sends its address to the base station, base station is node no. 1–10. We select the cluster head node according to the threshold of the advanced node and the ordinary node, other nodes decide which cluster to join based on the strength of the signal. The specific process is demonstrated in Fig. 2

Fig. 2
figure 2

The formation of the cluster

3.2 Aggregation process

Suppose a cluster has three nodes \(A\), \(B\) and \(C\), where \(A\) is cluster head node. \(a\), \(b\) and \(c\) respectively are the privacy data of each node. Here we will divide \(a\) into \(a_{1}\), \(a_{2}\) and \(a_{3}\), that is \(a = a_{1} + a_{2} + a_{3}\). The node \(a\) will generate false information while slicing, \(a_{1}\) correspond to \(a_{1}^{\prime }\), \(a_{2}\) correspond to \(a_{2} ^{\prime}\), \(a_{3}\) correspond to \(a_{3} ^{\prime}\), The information of privacy slice \(a_{1}\) is retained by this node \(A\).

Divide \(b\) into \(b_{1}\), \(b_{2}\) and \(b_{3}\), that is \(b = b_{1} + b_{2} + b{}_{3}\). The node \(b\) will generate false information while slicing, \(b_{1}\) correspond to \(b_{1}^{\prime }\), \(b_{2}\) correspond to \(b_{2}^{\prime }\), \(b_{3}\) correspond to \(b_{3}^{\prime }\), the node \(B\) send the information of privacy slice \(b_{1}\) to the cluster head node \(A\). send false information \(b_{1}^{\prime }\), \(b_{2}^{\prime }\) and \(b_{3}^{\prime }\) to the cluster head node \(A\). The process of node \(C\) is similar to that of node \(B\).

Node \(A\), \(B\) and \(C\) encrypt the privacy slice information and corresponding false information, then send them to other nodes randomly.

Where, the node \(A\) encrypts \(a_{2}\) and \(a_{2}^{\prime }\), then sends them to the node \(B\), encrypts \(a_{3}\) and \(a_{3}^{\prime }\), then sends them to the node \(C\). The process is demonstrated in Formula (7):

$$ \left\{ {\begin{array}{*{20}l} {Enc(a_{2} ,K_{AB} ),Enc(a_{2}^{\prime } ,K_{AB} );} \hfill \\ {Enc(a_{3} ,K_{AC} ),Enc(a_{3}^{\prime } ,K_{AC} );} \hfill \\ \end{array} } \right.. $$
(7)

Similarly, the node \(B\) encrypts \(b_{2}\) and \(b_{2}^{\prime }\), then sends them to the node \(A\), encrypts \(b_{3}\) and \(b_{3}^{\prime }\), then sends them to the node \(C\). The process is demonstrated in Formula (8):

$$ \left\{ {\begin{array}{*{20}l} {Enc(b_{2} ,K_{BA} ),Enc(b_{2}^{\prime } ,K_{BA} );} \hfill \\ {Enc(b_{3} ,K_{BC} ),Enc(b_{3}^{\prime } ,K_{BC} );} \hfill \\ \end{array} } \right.. $$
(8)

Similarly, the node \(C\) encrypts \(c_{2}\) and \(c_{2}^{\prime }\), then sends them to the node \(A\), encrypts \(c_{3}\) and \(c_{3}^{\prime }\), then sends them to the node \(B\). The process is demonstrated in Formula (9):

$$ \left\{ {\begin{array}{*{20}l} {Enc(c_{2} ,K_{CA} ),Enc(c_{2}^{\prime } ,K_{CA} );} \hfill \\ {Enc(c_{3} ,K_{CB} ),Enc(c_{2}^{\prime } ,K_{CB} );} \hfill \\ \end{array} } \right.. $$
(9)

Now the node \(A\), \(B\) and \(C\) decrypts the received data by utilizing the shared secret key, which can be calculated to \(F_{A}\), \(F_{B}\) and \(F_{C}\) as demonstrated in Formulas (10)–(12):

$$ \left\{ {\begin{array}{*{20}l} {Dnc(b_{2} ,K_{BA} ),Dnc(b_{2}^{\prime } ,K_{BA} );} \hfill \\ {Dnc(c_{2} ,K_{CA} ),Dnc(c_{2}^{\prime } ,K_{CA} );} \hfill \\ {F_{A} = b_{2} + b_{2}^{\prime } + c_{2} + c_{2}^{\prime } ;} \hfill \\ \end{array} } \right. $$
(10)
$$ \left\{ {\begin{array}{*{20}l} {Dnc(a_{2} ,K_{AB} ),Dnc(a_{2}^{\prime } ,K_{AB} );} \hfill \\ {Dnc(c_{3} ,K_{CB} ),Dnc(c_{3}^{\prime } ,K_{CB} );} \hfill \\ {F_{B} = a_{2} + a_{2}^{\prime } + c_{3} + c_{3}^{\prime } ;} \hfill \\ \end{array} } \right. $$
(11)
$$ \left\{ {\begin{array}{*{20}l} {Dnc(a_{3} ,K_{AC} ),Dnc(a_{3}^{\prime } ,K_{AC} );} \hfill \\ {Dnc(b_{3} ,K_{BC} ),Dnc(b_{3}^{\prime } ,K_{BC} );} \hfill \\ {F_{C} = a_{3} + a_{3}^{\prime } + b_{3} + b_{3}^{\prime } ;} \hfill \\ \end{array} } \right.. $$
(12)

The node B and C send FB and FC to node A. Here, FB mainly includes \(a_{2}\), \(a_{2}^{\prime }\), \(c_{3}\), \(c_{3}^{\prime }\), and similarly \(F_{C}\) includes \(a_{3}\), \(a_{3}^{\prime }\), \(b_{3}\), \(b_{3}^{\prime }\), \(F_{A}\) includes \(b_{2}\), \(b_{2}^{\prime }\), \(c_{2}\), \(c_{2}^{\prime }\). In the initial stage, node B and C send their first slice data and all the false information to the cluster head node A, then A knows \(b_{1}\), \(c_{{1}}\), \(a_{1}^{\prime }\), \(a_{2}^{\prime }\), \(a_{3}^{\prime }\), \(b_{1}^{\prime }\), \(b_{2}^{\prime }\), \(b_{3}^{\prime }\), \(c_{1}^{\prime }\), \(c_{2}^{\prime }\), \(c_{3}^{\prime }\) and knows \(a_{{1}}\) (its first slice data). Finally, A gets the aggregation values of \(a_{{1}}\), \(a_{2}\), \(a_{3}\), \(b_{1}\), \(b_{2}\), \(b_{3}\), \(c_{{1}}\), \(c_{{2}}\), \(c_{3}\), \(a_{1}^{\prime }\), \(a_{2}^{\prime }\), \(a_{3}^{\prime }\), \(b_{1}^{\prime }\), \(b_{2}^{\prime }\), \(b_{3}^{\prime }\), \(c_{1}^{\prime }\), \(c_{2}^{\prime }\) and \(c_{3}^{\prime }\). If we set \(S_{1}\) to be the sum of \(a_{{}}\), \(b_{{}}\) and \(c_{{}}\), the values of \(S_{1}\) can be obtained without knowing b and c. Figures 3, 4 and 5 demonstrated the aggregation process of all nodes.

Fig. 3
figure 3

Privacy data slicing

Fig. 4
figure 4

Encrypts and sends

Fig. 5
figure 5

Data aggregation

3.3 The SECPDA algorithm flow

In SECPDA algorithm, node clustering, node data information processing and data aggregation algorithm flow are as follows:

figure a
figure b
figure c

4 Simulation and performance analysis

In this work, we introduced a secure and efficient privacy-preserving data aggregation algorithm. To simulate the clustering aggregation process of sensor nodes, we executed our algorithm on MATLAB. we suppose that 100 sensor nodes are deployed randomly in 100 × 100 area. The base station node is centrally located (50, 50). Set the proportion of the advanced nodes to be 0.1. Figure 6 demonstrates the random deployment of 100 sensor nodes in 100 × 100 area. Figure 7 demonstrates the result of cluster-head election. Figure 8 is the result of clustering by distance matrix. Figure 9 demonstrates the cluster heads gather the aggregation results to the base station.

Fig. 6
figure 6

Sensor node distribution

Fig. 7
figure 7

Cluster head node election

Fig. 8
figure 8

Within the cluster aggregation

Fig. 9
figure 9

Outside the cluster aggregation

4.1 Communication overhead

We analyze the communication overhead of CPDA, ILCCPDA and SECPDA algorithms respectively. In the CPDA algorithm, nodes within the cluster need to broadcast their own seeds within the cluster. Suppose a cluster has \(n\) nodes, So there are \(n\) nodes broadcasting the seeds, and each node needs to send encrypted data to other neighbor nodes, then each node sends \(n - 1\) data, finally, \(n\) sensor nodes will send their aggregation data to the cluster head node. In the ILCCPDA algorithm, \(n\) nodes send values to two nodes. The other nodes in the cluster will send the aggregation values to the cluster head node. In the SECPDA algorithm, each node randomly selects two nodes to send its own two privacy slicing information and corresponding false information respectively, then \(n\) nodes emit \(4n\) data. Finally, \(n\) sensor nodes send the aggregation data to the cluster head node. We experimentally verified the analysis results. As can be seen from Fig. 10, in the whole network, SECPDA algorithm has less communication overhead than CPDA algorithm and is not significantly different from ILCCPDA algorithm. In SECPDA and ILCCPDA algorithms, both have private data slicing technology. With the increase of the number of clusters, the communication overhead of data is not very obvious, but the CPDA algorithm has been greatly increased.

Fig. 10
figure 10

Communication overhead comparison

The communication overhead in the whole network needs to consider the communication overhead formed by the network topology, the communication overhead in a cluster and the communication overhead between clusters. The Fig. 11 is a simulation of the communication overhead in the whole network. As can be seen from the Fig. 11, the communication overhead in the whole network of SECPDA algorithm and ILCCPDA algorithm is less than CPDA algorithm.

Fig. 11
figure 11

Communication overhead of the entire network

4.2 Privacy performance analysis

In CPDA algorithm, when the sensor nodes exchange messages within the same cluster, the privacy data will be leaked to the neighbor nodes. For a cluster of size \(m\), the node sends \(m - 1\) encrypted messages to the other \(m - 1\) members of the cluster. The node can only be cracked if the attacker obtains the \(m - 1\) keys, Otherwise, private data cannot be exposed. The average probability of data of all nodes in the cluster being cracked can be obtained as demonstrated in Formula (13):

$$ P_{1} (q) = \sum\limits_{{k = m_{c} }}^{{d_{\max } }} P (m = k)\left( {1 - \left( {1 - q^{k - 1} } \right)^{k} } \right). $$
(13)

Here, \(m_{c}\) is the minimum number of nodes in the cluster, and \(d_{\max }\) is the maximum number of nodes in the cluster, and \(q\) is the probability that the node link is cracked.

In ILCCPDA algorithm, if an eavesdropper wants to steal data from node s, they must know the two slices of data from node s and the information from the neighbor node. Therefore, the eavesdropper must break the link between node s and the neighbor node that gets the slice information of node s, as well as the link between the neighbor node and the cluster node. The average probability of data of all nodes in the cluster being cracked can be obtained as shown in Formula (14):

$$ P_{2} (q) = q^{2} \sum\limits_{k = 0}^{n - 1} {P(in = k)q^{k} } . $$
(14)

Here, \(q^{{2}}\) represents the probability of information being stolen from two neighboring nodes, and \(\sum\nolimits_{k = 0}^{n - 1} {P(in = k)q^{k} }\) represents the probability of all transmitted information being stolen.

In our proposed SECPDA algorithm, each node in the cluster randomly selects two neighbor nodes, then sends encrypted privacy slices and false information to the neighbor node. And, each node only sends two encrypted messages, and the number of encrypted messages received by each node is uncertain. The attacker needs to crack the slice information sent by the node and the information received by the node. Therefore, the average probability of data of all nodes in the cluster being cracked as demonstrated in Formula (15):

$$ P_{{3}} (q) = C_{2}^{1} qC_{2}^{1} q\sum\limits_{k = 0}^{n - 1} {P(in = k)C_{2}^{1} q^{k} } . $$
(15)

Here, \(q\) is the probability that the node link is cracked. \(C_{2}^{1} qC_{2}^{1} q\) is the probability of messages being cracked, \(\sum\nolimits_{k = 0}^{n - 1} {P(in = k)C_{2}^{1} q^{k} }\) is the probability of the received message being cracked, \(C_{2}^{1}\) stands for which of the hacked information is slice information or false message. \(P(in = k)\) represents the probability that k nodes send information to node s. \(P(in = k)\) is shown in Formula (16):

$$ P(in = k) = C_{n - 1}^{k} \left( {\frac{1}{n - 1}} \right)^{k} \left( {\frac{n - 2}{{n - 1}}} \right)^{n - 1 - k} . $$
(16)

When the probability of node link being cracked takes different values, the comparison of the probability of private data being stolen from CPDA, ILCCPDA and SECPDA algorithm is shown in Fig. 12. In Fig. 12, the privacy protection capability of SECPDA algorithm is higher than that of CPDA algorithm and ILCCPDA algorithm.

Fig. 12
figure 12

Privacy comparison

Security proof

The attack model is as follows.

We suppose that any adversary (Liu et al. 2020a, b; Wang et al. 2020a, b, c, d) wants to steal the private data of node A, the adversary needs to obtain all slice data (a1, a2 and a3) of private data a. Such an approach is harder to obtain the privacy data a than the unsliced private data. When this adversary attacks the privacy data, the attacker still can't get rid of the interference of false information. This is owning to the false information and slice privacy data are encrypted with this random key distribution scheme. Even if the data (sent by node A to other nodes) is intercepted by the adversary, the adversary also needs to ensure whether the eavesdropped data is a slice data of private data or a false data.

If any adversary wants to impersonate the node and exchange the slice data with some attacked nodes. Owing to the slice data is only a part of private data, the slice data does not make much sense, and the node also have false information to interfered, which increases the difficulty of being intercepted.

We can utilize the probability formula to analyze the privacy protection capability. In Formula (15), when q = 0.02, \(P_{{3}} (q)\) is about 0, which has a very low probability of being intercepted.

5 Conclusion

We propose a new data aggregation privacy protection algorithm called SECPDA, based on CPDA algorithm. Our algorithm adopts SEP protocol for cluster head elections and nodes aggregation. It greatly reduces communication overhead. The ability of privacy protection is improved by adding false information to interfere. After theoretical analysis and simulation experiments, the privacy performance of the SECPDA algorithm is better than CPDA algorithm and ILCCPDA algorithm, data traffic is also effectively reduced. For the future work, we are going to investigate how to ensure the integrity of data in the transmission process.