Introduction

With the rapid development of information technology, in addition to the government, many companies now have a large amount of data about citizens’ personal information. As a powerful data analysis tool, data mining can identify and extract implicit, unknown, novel and potentially useful knowledge and rules from a large number of incomplete and noisy data. Data mining makes great contributions in scientific research, business decision-making, medical research, and other fields [1,2,3]. At the same time, it also produces the inevitable problem of privacy disclosure, which has attracted more and more attention from the industry and society. For example, using data mining and machine learning to mine medical case records can obtain sensitive information, such as patients’ diseases. Privacy protection technology can solve the privacy threat caused by data analysis. How to analyze and not disclose private information is the main purpose of privacy protection technology. In recent years, many privacy protection technologies have emerged, such as k-anonymity, l-diversity, and t-proximity. K-anonymity [4] can make every individual information contained in anonymous datasets indistinguishable from other \(k-1\) individual information. However, k-anonymity cannot prevent the attributes disclosure, so attackers can obtain sensitive information through the background knowledge attack and consistency attack [5, 6]. The idea of l-diversity [7] is that the values of sensitive attributes should be diverse, so as to ensure that the user’s sensitive information cannot be inferred from the background knowledge. In a real dataset, the attribute values are likely to be skewed or semantically similar, while l-diversity only guarantees diversity, and it is not recognized that the attribute values are semantically similar. Therefore, l-diversity will be attacked by similarity attack [6]. T-proximity [8] means that the distribution of sensitive attributes in a category is close to the distribution of the attribute in the whole data, and does not exceed the threshold value t. Because t-proximity can prevent attribute disclosure but not identity disclosure, k-anonymity and l-diversity may be required when dealing with problems. And t-proximity has a large amount of information loss and is more difficult to generalize. In addition, there are some new attack models, such as combined attack [9] and foreground knowledge attack [10]. These new attack models pose a severe challenge to the effectiveness of the above methods.

Differential privacy (DP) is a widely recognized strict privacy protection technology, which was first proposed by Dwork [11]. It makes malicious adversaries unable to infer the user’s sensitive information even if they know the results published by the user. DP has become the de-facto privacy standard around the world in recent years, with the U.S. Census Bureau using it in their Longitudinal Employer-Household Dynamics Program in 2008 [12], and the technology company Apple implementing DP in their latest operating systems and applications [13]. Applying DP to machine learning model can protect training data from model reverse attack when model parameters are released.

Classification technology plays a key role in data prediction and analysis, and decision tree (DT) is a typical representative of classification models. DT is a non-parametric supervised learning method, and it makes no assumptions about the distribution of the underlying data [14]. However, the transparency property of DT can be used by attackers to steal personal information. Suppose there are two adjacent datasets used to train two trees, which are different in one record at most. Opponents can obtain sensitive personal information from the database by comparing the counting results. To solve the problem of privacy disclosure of DT, scholars apply DP to the construction of DT to realize privacy protection.

Although scholars have put forward many DPDT models, there are still some problems need to be overcome:

  • When inner nodes use the Laplace mechanism to realize privacy protection, no matter what split criterion is adopted, there must be a lot of fine-grained counting queries, which will inevitably lead to the accumulation of noise.

  • In the process of privacy budget allocation, none of the DPDT models takes into account the importance of leaf nodes in the final classification prediction. When the difference in the number of samples between the category with the most samples and the category with the second most samples in leaf nodes is small, adding more noise will directly lead to error classification results. On the contrary, when the difference between them is large, leaf nodes can bear more noise, so as to improve privacy protection ability of DPDT. Obviously, we should formulate a unique privacy budget allocation strategy for each leaf node.

  • Many DPDT models can only deal with discrete attributes, and the other part directly uses the exponential mechanism to select partition points in discretization of continuous attributes. The exponential mechanism needs to traverse all potential partition points of each continuous attribute, which will lead to low efficiency and need to consume privacy budget.

  • The bootstrap sampling makes each training set intersect, so the total privacy budget needs to be evenly allocated to each base classifier. The smaller the privacy budget, the greater the noise, and the worse the classification performance of base classifiers.

  • In the ensemble learning models based on DPDT, the voting strategy always use the majority voting or assign weights by the accuracy of base classifiers. To improve the privacy protection ability, DPDT has to introduce noise in the process of constructing nodes, which will lead to the weak classification ability of some base learners. Therefore, it is necessary to set an appropriate weight for each basic classifier to obtain an ensemble model with strong classification ability.

To overcome the above issues, we propose a DPDT model, called DPtree. In addition, to improve the classification ability of DPtree, we propose an ensemble learning model based on DPtree, called En-DPtree. And the main contributions in this article are as follows:

  • The exponential mechanism is utilized in inner nodes, which can obtain the split attribute only by one calculation without multiple counting queries, so as to avoid noise accumulation.

  • We formulate an adaptive privacy budget allocation strategy for each leaf node according to its sample category distribution, which can not only ensure that the classification result of each leaf node is not distorted, but also improve the privacy protection ability of DPtree.

  • The Fayyad theorem is applied to quickly locate the best partition points of continuous attributes by comparing the adjacent boundary points of different categories, which can greatly improve the computational efficiency and does not occupy privacy budget.

  • To economize on privacy budget, we use the sampling without replacement method to obtain training sets. Therefore, the privacy budget of each DPtree is equal to the total privacy budget.

  • We design multi-population quantum genetic algorithm (MPQGA) to search the appropriate weight for each base classifier, so as to improve the classification ability of En-DPtree. In MPQGA, the individuals of each population evolve to the optimal solution of their own population, which will reduce the possibility of the algorithm finally falling into the local optimal solution. In addition, we also design immigration operators and elite groups to avoid premature convergence of each population.

The rest of this paper is organized as follows. The section “Related work” presents previous works related to DPDT. The section “Differential privacy” introduces the contents of DP. The section “Analysis on MaxTree algorithm” shows the analysis on MaxTree. The section “Differentially private decision tree” describes the specific implementation steps of DPtree and En-DPtree. The section “The ensemble learning based on DPtree” shows the experimental process and results. The section “Time complexity analysis” shows the time complexity analysis of DPtree and En-DPtree. The section “Conclusion” presents the conclusion.

Related work

The representative schemes in the DPDT models are SuLQ-based ID3, DiffP-C4.5, and DiffGen. In 2005, Blum et al. first introduced DP into DT based on the SuLQ framework and proposed SuLQ-based ID3  [15]. This algorithm achieves the ability of privacy protection by adding Laplace noise to the query results in ID3. However, adding Laplace noise to each count of information gain will lead to a large amount of accumulated noise and waste of privacy budget. To solve the problems, Friedman and Schuster proposed PINQ-based ID3 [16]. This algorithm implements ID3 on multiple mutually exclusive data subsets, so it can effectively use privacy budget. However, PINQ-based ID3 also needs to add Laplace noise in the calculation of information gain, which still cannot significantly reduce the noise. Besides, Friedman and Schuster proposed DiffP-ID3 by the exponential mechanism [17]. Since the exponential mechanism can evaluate attributes only through one query, DiffP-ID3 can reduce the introduction of noise. Because ID3 can only deal with discrete data, Friedman and Schuster proposed DiffP-C4.5 [17] which can deal with continuous data based on the exponential mechanism. Different from the above methods, DiffGen [18] uses the classification tree to divide all records in the dataset into leaf nodes from top to bottom, and then adds Laplace noise to the count value in leaf nodes. The classification accuracy of DiffGen is improved and each classification attribute corresponds to a classification tree. When the dimensions of sample attributes are very large, this will lead to inefficient selection based on the exponential mechanism, and may exhaust privacy budget. In addition, there are many other DPDT models. During the construction of DT, the number of instances in each layer is usually decreasing. If each layer is given the same privacy budget, it will inevitably lead to imbalance of signal-to-noise ratio of DT. Therefore, Liu et al. [19] designed a budget allocation strategy, so that less noise would be added in larger depth to balance between true counts and noise. To reduce influence of noisy by the Laplace mechanism, Wu et al. [20] designed the up–down and bottom–up approaches to reduce the number of nodes in a random DT. However, no matter which strategy is used to build DPDT, it cannot avoid introducing noise in the process of node generation, resulting in weak and unstable classification ability. How to not only protect privacy but also improve classification ability of DT is a difficult problem.

Luckily, ensemble learning can always significantly improve the generalization ability of base learners in most cases by training multiple learners and combining their results. Therefore, scholars use ensemble learning to improve the classification performance of DPDT. Random forest under DP was first proposed by Jagannathan et al. [21]. In this model, the base classifier is ID3 DPDT. However, they found that to obtain reasonable privacy protection, it needed to pay a great loss of prediction accuracy. Therefore, they proposed an improved ensemble learning based on random DT, which we call it as En-RDT. Fletcher and Islam [22] proposed a DP decision forest that took advantage of a theorem for the local sensitivity of the Gini Index. In addition, they used the theory of Signal-to-Noise Ratios to automatically tune the model parameters [23]. However, the split features are randomly selected which might make the tree worse. Patil and Singh [24] designed a random forest algorithm satisfying DP, which is called DiffPRF. DiffPRF first uses entropy to discrete continuous attributes, and then uses DiffP-ID3 as the base learner to generate random forest. Subsequently, Yin et al. [25] further proposed DiffP-RFs based on DiffPRF. DiffP-RFs does not need to preprocess dataset, and it extends the monotonic privacy budget allocation strategy that DiffPRF can only deal with discrete attributes. Compared with random forests, boosting with DP has rarely been researched. Privacy-preserving boosting usually uses distributed training. Each party uses their own data to train a DP boosting classifier, and finally aggregates classifiers through a trusted third party without sharing their own datasets [26,27,28]. This method allows users to define privacy levels according to their own needs, but training data are often limited. Li et al. [29] proposed a gradient boosting based on DP. They filter the data based on gradient and use geometric leaf cutting to ensure a smaller sensitivity boundary. Shen [30] proposed DP-AdaBoost using single-layer ID3. The algorithm does not use counting function directly when adding noise, but considers the weights of each record at the same time. However, the algorithm ignores the influence of tree depth on classification ability. Jia and Qiu [31] proposed DP-AdaBoost based on CART and this model can handle continuous features.

Differential privacy

DP is a strict privacy definition against the individual privacy leakage that guarantees the outcome of a calculation to be insensitive to any particular record in the dataset. In the following, we introduce the definition of DP and two important theorems.

Definition 1

(Differential privacy [11]) We say a randomized computation F provides \(\epsilon \)-DP if for any adjacent datasets \(D_{1}\) and \(D_{2}\) with symmetric difference \(D_{1}\mathrm{\Delta } D_{2}=1\), and any set of possible outcomes \(S\in Range(F)\)

$$\begin{aligned} Pr[F(D_{1})\in S]\le e^{\epsilon }\cdot Pr[F(D_{2})\in S]. \end{aligned}$$
(1)

The parameter \(\epsilon \) is called the privacy budget and is inversely proportional to the strength of privacy protection.

Definition 2

(Sensitivity [32]) Given an arbitrary function \(f:D\rightarrow R^{d}\), the sensitivity of f is defined as

$$\begin{aligned} \Delta f=\begin{array}{c} \max \\ D_{1},D_{2} \end{array}||f(D_{1})-f(D_{2})||_{1}, \end{aligned}$$
(2)

where \(D_{1}\) and \(D_{2}\) differ in one record and d is the dimension of the function f.

Theorem 1

(Laplace mechanism [32]) Given an arbitrary function \(f:D\rightarrow R^{d}\), for an arbitrary domain D, the function F provides \(\epsilon \)-DP, if F satisfy

$$\begin{aligned} F(D)=f(D)+\left( Lap\left( \frac{\Delta f}{\epsilon }\right) \right) ^{d}, \end{aligned}$$
(3)

where the noise \(Lap\left( \frac{\Delta f}{\epsilon }\right) \) is drawn from a Laplace distribution, and d is the dimension of the function f.

Theorem 2

(Exponential mechanism [33]) Given a random mechanism F, its input is dataset D, and the output is an entity object \(r\in Range\). Let q(Dr) be a score function to assign each output r a score, and \(\Delta q\) be the sensitivity of the score function. Then, the mechanism F maintains \(\epsilon \)-DP, if F satisfies

$$\begin{aligned} F(r,q)=\left\{ \textrm{return}\ r \mathrm{\ with \ probablity} \propto \textrm{exp}\left( \frac{\epsilon q(D,r)}{2\Delta q}\right) \right\} .\nonumber \\ \end{aligned}$$
(4)

Analysis on MaxTree algorithm

Maxtree algorithm [19] proposed by Liu et al. is shown in Algorithm 1. When constructing DT, it introduces noise to relevant counting results by the Laplace mechanism to realize privacy protection. Generally, as tree depth increases, the node counting results decrease accordingly. If all nodes are allocated the same privacy budget, signal-to-noise ratio of nodes at different depths must be unbalanced. To overcome this defect, Maxtree allocates more privacy budget for deeper nodes.

Specifically, assuming that only k attributes are used to build branch nodes, \(\frac{1}{k-i}\) shares of privacy budget are allocated to each inner node of the ith layer and 1 share to each leaf node. Then, the total privacy budget shares of DT are

$$\begin{aligned} s_{t}=\frac{1}{k}+\frac{1}{k-1}+\cdots +\frac{1}{k-(k-1)}+1. \end{aligned}$$
(5)

Suppose \(\epsilon \) denotes total privacy budget of DT, then Maxtree allocates \(\frac{\epsilon }{s_{t}}*\frac{1}{k-i}\) privacy budget to each inner node in the ith layer. At this point, there are \(k-i\) attributes need to be evaluated through related attribute evaluation function. Therefore, the privacy budget required to evaluate each attribute is \(\epsilon _{i}=\frac{\epsilon }{s_{t}}*\frac{1}{(k-i)^2}\).

In internal nodes, the use of the Laplace mechanism must involve counting queries. The more counting queries, the more shares of the privacy budget are divided, which will lead to the greater noise disturbance for each query. Maxtree selects the maximum label vote rather than information gain in attribute evaluation process, which can relatively reduce the number of queries. However, after allocating privacy budget for each layer, Maxtree still needs to divide the budget again to each feature for evaluation. Obviously, the use of the Laplace mechanism in internal nodes cannot fundamentally avoid quadratic partition of privacy budget.

In leaf nodes, the query datasets do not intersect, so the privacy budget allocated to leaf nodes does not need to be divided again. Because leaf nodes are the key to classification, Maxtree allocates the most privacy budget for each leaf node, that is, 1 share. The larger the privacy budget, the less noise generated, and the weaker the privacy protection capability. Therefore, the privacy budget strategy in leaf nodes of Maxtree leads to weak privacy protection ability.

figure a

On the whole, there are several difficulties in constructing DPDT. First, the counting queries of attributes in internal nodes need to subdivide privacy budget, resulting in high noise and low accuracy of DT. Second, assigning the same share of privacy budget to all leaf nodes only takes into account the reduction of noise, but leads to weak privacy protection ability.

Differentially private decision tree

In this section, we design an improved DPDT algorithm called DPtree. DPtree uses the Fayyad theorem to quickly discretize continuous attributes. Besides, DPtree uses the exponential mechanism and Laplace mechanism to construct inner nodes and leaf nodes, respectively. Each inner node has the same privacy budget, and each leaf node can adaptively adjust privacy budget according to the sample category distribution. The generation process of DPtree can not only ensure that the classification results are not distorted, but also improve the privacy protection ability. The key of constructing DPDT lies in the generation of tree structure and the allocation of privacy budget. In the section “Generation of tree structure”, we introduce some important problems involved in the generation of tree structure. In the section “Privacy budget allocation strategy”, we introduce a practical privacy budget allocation strategy. The privacy analyze is shown in the section “Privacy analyze”.

Generation of tree structure

In the process of growing DT, two kinds of nodes are involved, namely inner nodes and leaf nodes. According to the analysis of Maxtree, if inner nodes use the Laplace mechanism, no matter which rule is used to evaluate attributes, it inevitably needs some low granularity queries. Fine-grained queries lead to accumulation of noise and ultimately affect classification accuracy. Therefore, we choose the exponential mechanism in inner nodes, and quality function is the information gain rate. The exponential mechanism does not need to multiple counting queries, it only needs one time calculation to acquire split attributes. Obviously, the exponential mechanism can economize on privacy budget and overcome the inherent defects of the Laplace mechanism in inner nodes. In leaf nodes, we still use the Laplace mechanism to compare the number of samples in each category.

In addition, continuous attributes are usually involved in the construction of DT. C4.5 algorithm [34] has a great advantage over ID3 algorithm, that is, it can deal with both discrete attributes and continuous attributes. C4.5 algorithm utilizes dichotomy to discretize continuous attributes. The specific process is as follows: assume that there are m different values of continuous attribute A in dataset D, and sort them in ascending order to get \(\{a_{1}, a_{2},\ldots , a_{m}\}\). Calculate the middle points \(t=(a_{i}+a_{i+1})/2\) of every two adjacent elements, and t denotes a potential partition point (there are \(m-1\) potential partition points in total). Among the \(m-1\) potential partition points, the point with the largest information gain rate is the optimal partition point of attribute A. Obviously, C4.5 algorithm needs to traverse all potential partition points of each continuous attribute, which will lead to low efficiency. How to quickly locate the best partition point of continuous attributes has become an urgent problem to be solved.

We use the Fayyad theorem [35] to quickly locate the best partition point of each continuous attributes. According to the Fayyad theorem, the best partition point always appears on the boundary points of two adjacent heterogeneous instances. Therefore, there is no need to compare each threshold point, just compare the adjacent boundary points of different categories to discretize continuous attributes. Suppose the instance attribute values after ascending sorting are \(\{a_{1}, a_{2},\ldots , a_{9}\}\), in which the first three samples belong to category \(c_{1}\), the middle three belong to category \(c_{2}\), and the last three belong to category \(c_{3}\). According to the Fayyad theorem, only boundary points \(a_{3}\) and \(a_{6}\) are the potential partition points, and the potential partition point with the largest information gain rate is the best partition point of attribute A. If we use the dichotomy, there are eight potential partition points. Obviously, the continuous attribute discretization method based on the Fayyad theorem can greatly improve the efficiency.

During the construction of DPtree, it is necessary to determine whether the current node is a leaf node or an inner node. If all the samples of the current node belong to the same category, or the attribute set to be selected is empty, or the current node reaches the maximum depth, the current node is a leaf node. In addition, the Laplace noise is added to the number of samples contained in the current node, that is, \(N_{node}=NoisyCount_{\epsilon _{0}}(node)\). If \(N_{node}\) is smaller than the threshold \(\tau \), the current node is a leaf node. Otherwise, the current node is an inner node. When it is a leaf node, use the Laplace mechanism to add noise to the number of all categories of the node, and classify the node according to \(argmax_{c}(n_{c}+Lap(1/\epsilon _{l}))\). When it is an inner node, the exponential mechanism is used to select split attributes, that is, \(\overline{A}=ExpMech_{\epsilon _{0}}(\mathcal {A},q)\). The score function q of the exponential mechanism is the information gain rate. On the whole, we grow a DPtree model, as shown in Algorithm 2. And an overview of DPtree is shown in Fig. 1. DPtree discretizes continuous attributes based on the Fayyad idea, and adopts the exponential mechanism and the Laplace mechanism in internal nodes and leaf nodes, respectively.

Fig. 1
figure 1

An overview of DPtree

figure b

Privacy budget allocation strategy

For inner nodes, when we use the exponential mechanism to select split attributes, it has little correlation with the number of instances, so each inner node is allocated an equal privacy budget \(\epsilon _{0}\). In addition, since the classification of leaf nodes directly affects the final classification results, enough privacy budget is allocated to leaf nodes, that is, \(2\epsilon _{0}\), so less noise is introduced to ensure that the classification results are not distorted. From the above analysis of Maxtree, it can be seen that allocating the same and enough privacy budget to each leaf node only considers noise reduction, but it will lead to weak privacy protection ability. Therefore, we should formulate an unique privacy allocation strategy for each leaf node.

When there is a large difference between the number of samples contained in the maximum category and the submaximum category in a leaf node, even if large noise is added, the classification result will not be distorted. At this time, less privacy budget can be allocated to it, so as to improve the privacy protection ability. Suppose there are two types of data in the leaf nodes \(n_{a}\) and \(n_{b}\). In the leaf node \(n_{a}\), there are ten samples which belong to the category \(c_{1}\) and two samples belong to the category \(c_{2}\). In the leaf nodes \(n_{b}\), there are seven samples belong to the category \(c_{1}\) and five samples belong to the category \(c_{2}\). Suppose that the Laplace noise of the category \(c_{1}\) and the category \(c_{2}\) in leaf nodes are \(L_{1}\) and \(L_{2}\). When \(L_{2}<8+L_{1}\), the classification result of the leaf node \(n_{a}\) will not be distorted. When \(L_{2}<2+L_{1}\), the classification result of the leaf node \(n_{b}\) will not be distorted. It is obvious that the leaf node \(n_{a}\) can tolerate more noise than the leaf node \(n_{b}\). Therefore, we can formulate an adaptive privacy budget allocation strategy for each leaf node according to its sample category distribution. Let \(\alpha \) be the ratio of the submaximum category to the maximum category, then the privacy budget of this leaf node is \(\epsilon _{l}=\alpha *2\epsilon _{0}\). On the one hand, when the difference between the submaximum and the maximum is very small, the privacy budget is almost equal to \(2\epsilon _{0}\), which can ensure that the classification result is not distorted. On the other hand, the greater the difference between the submaximum and the maximum, the lower the privacy budget and the stronger the privacy protection ability. Therefore, this adaptive privacy allocation scheme can not only ensure that the classification result of each leaf node is not distorted, but also improve the privacy protection ability.

Privacy analyze

DP has two important properties, namely the sequential combination and the parallel combination, which play an important role in privacy budget allocation. Suppose a randomization mechanism \(F_{i}\) is applied to subsets of database D providing \(\epsilon _{i}\)-DP. When the subsets are joint, the sequential combination makes the whole mechanism provides \(\sum _{i}\epsilon _{i}\)-DP [36]. When the subsets are disjoint, the parallel combination indicates the whole mechanism provides \(max\ \epsilon _{i}\)-DP [36].

Theorem 3

DPtree satisfies \(\epsilon \)-DP.

Proof

Assume adjacent datasets D and \(D^{'}\) with symmetric difference \(D\mathrm{\Delta } D^{'}=1\), F(D) and \(F(D^{'})\), respectively, represent the output of the random algorithm; each attribute \(A_{x}\) has \(|r_{x}|\) division methods, and we can get

$$\begin{aligned}{} & {} \frac{prob(F(D)=r_{i})}{prob(F(D^{'})=r_{i})}= \frac{\prod _{i=1}^{|r_{x}|}p(r_{i})E(D,r_{i})}{\prod _{i=1}^{|r_{x}|}p(r_{i})E(D^{'},r_{i})} \nonumber \\{} & {} \quad =\prod _{i=1}^{|r_{x}|} \frac{\left( \exp \left( \frac{\epsilon _{0}q(D,r_{i})}{2|r_{x}|\Delta q}\right) \right) ^{2}}{\left( \exp \left( \frac{\epsilon _{0}q(D^{'},r_{i})}{2|r_{x}|\Delta q}\right) \right) ^{2}} \cdot \frac{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D^{'},r_{i})}{2|r_{x}|\Delta q}\right) }{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D,r_{i})}{2|r_{x}|\Delta q}\right) }\nonumber \\{} & {} \quad =\prod _{i=1}^{|r_{x}|}\left( \exp \left( \frac{\epsilon _{0}(q(D,r_{i})-q(D^{'},r_{i}))}{2|r_{x}|\Delta q}\right) \right) ^{2}\nonumber \\{} & {} \qquad \cdot \frac{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D^{'},r_{i})}{2|r_{x}|\Delta q}\right) }{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D,r_{i})}{2|r_{x}|\Delta q}\right) }\nonumber \\{} & {} \quad \le \prod _{i=1}^{|r_{x}|}\exp \left( \frac{\epsilon _{0}}{2|r_{x}|}\right) \cdot \exp \left( \frac{\epsilon _{0}}{2|r_{x}|}\right) \cdot \frac{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D,r_{i})}{2|r_{x}|\Delta q}\right) }{\sum _{r_{i}\in Range}\exp \left( \frac{\epsilon _{0}q(D,r_{i})}{2|r_{x}|\Delta q}\right) }\nonumber \\ {}{} & {} \quad =\prod _{i=1}^{|r_{x}|}\exp \left( \frac{\epsilon _{0}}{|r_{x}|}\right) \nonumber \\{} & {} \quad =\exp (\epsilon _{0}). \end{aligned}$$
(6)

The degree of differential privacy protection for inner nodes is

$$\begin{aligned} \prod _{i=1}^{d}\exp (\epsilon _{0})=\exp (d\epsilon _{0}). \end{aligned}$$
(7)

Therefore, the total privacy budget of inner nodes is \(d\epsilon _{0}\).

Add the Laplace noise to class counting function \(Count(D)\), and the output after disturbance is denoted as \(NoisyCount(D)=Count(D)+Lap(1/\epsilon _{l})\). We can get

$$\begin{aligned} \begin{aligned}&\frac{prob(NoisyCount(D)\in S )}{prob(NoisyCount(D^{'})\in S)}\\&=\frac{\exp (-\epsilon _{l}|Count(D)-NoisyCount(D)|)}{\exp (-\epsilon _{l}|Count(D)-NoisyCount(D^{'})|)}\\&=\exp (\epsilon _{l}(|Count(D)-NoisyCount(D^{'})|-|Count(D)\\&-NoisyCount(D)|))\\&\le \exp (\epsilon _{l}|NoisyCount(D)-NoisyCount(D^{'})|)\\&\le \exp (2\epsilon _{0}). \end{aligned}\nonumber \\ \end{aligned}$$
(8)

Therefore, the total privacy budget of leaf nodes is \(2\epsilon _{0}\).

Besides, each node of DPtree needs to be judged as an inner node or a leaf node, and the privacy budget of the process is \(\epsilon _{0}\). We can get the total privacy budget of DPtree is \(d\epsilon _{0}+2\epsilon _{0}+(d+1)\epsilon _{0}\). Therefore, DPtree satisfies \(\epsilon \)-DP. \(\square \)

The ensemble learning based on DPtree

DP improves the security of DT, but the accuracy decreases. By constructing multiple DT models into an integration, decision forest reduces the impact of large variance of DT models on generalization performance to a certain extent. On the other hand, the greedy strategy is used in the construction of DT, and the decision forest can get better performance through integration. Therefore, we use the bagging technology to build the ensemble model based on DPtree.

Generally, the bagging technology [37] uses bootstrap sampling to obtain a set of training sets \(\{D_{1}, D_{2},\ldots , D_{M}\}\), each DT is trained based on these training sets, and then, test samples are predicted by voting method. Assume the total privacy budget is \(\epsilon \), due to the intersection between each training set, the privacy budget allocated to each DT is only \(\epsilon /M\). To economize on privacy budget, we use the sampling without replacement method to obtain training sets \(\{D_{1}, D_{2},\ldots , D_{M}\}\). Therefore, the privacy budget of each DT is \(\epsilon \). Furthermore, the weighted voting scheme usually sets the weights manually or sets the confidence of base learners as the weights, which are usually not optimal. The weights of base classifiers play a great role in the final classification results. It is necessary to get a set of better weights by searching weight space. Therefore, we need to design an intelligent optimization algorithm to generate a set of better weights for ensemble model.

Genetic algorithm (GA) [38] is good at solving global optimization problem, and it can efficiently jump out of local optimal points to find the global optimal point. However, when the selection, crossover, and mutation operators are not appropriate, GA will cost many iterations, and have slow convergence, premature convergence, and many other defects. In recent years, using quantum theory to further improve intelligent optimization algorithm has become a very popular research direction [39, 40]. Quantum genetic algorithm (QGA) [41] is a kind of fusion algorithm that introduces efficient quantum computing into GA, uses the quantum state to encode, and selects the quantum rotating gate to perform genetic operation. Compared with ordinary GA, QGA can keep the population diversity because of the quantum superposition state, and it simplifies the calculation and reduces the operation steps through using the quantum rotating gate. With the help of QGA, we can generate appropriate weights for each DT, so as to further improve the performance of ensemble model.

In the evolution process of QGA, the individuals of each iteration are updated in the direction of the current optimal solution. If the current optimal solution is a local solution, the algorithm may eventually fall into a local optimal solution. Therefore, we design a multi-population strategy. The individuals of each population evolve to the optimal solution of their own population, which will reduce the possibility of the algorithm finally falling into the local solutions. In addition, an immigration operation is designed to exchange the individuals with the largest fitness and the smallest fitness between populations, which can avoid premature convergence of each population.

Fig. 2
figure 2

An overview of En-DPtree

When we use MPQGA to optimize the M weights of base classifiers, each chromosome is coded by qubits as follows:

$$\begin{aligned} q_{j}^{t}=\begin{bmatrix} \alpha _{11}^{t} &{}\cdots &{}\alpha _{1k}^{t} &{} \alpha _{21}^{t} &{}\cdots &{}\alpha _{2k}^{t} &{}\cdots &{} \alpha _{s1}^{t} &{} \cdots &{} \alpha _{sk}^{t} \\ \beta _{11}^{t} &{}\cdots &{}\beta _{1k}^{t} &{} \beta _{21}^{t} &{}\cdots &{}\beta _{2k}^{t} &{}\cdots &{} \beta _{M1}^{t} &{} \cdots &{} \beta _{Mk}^{t} \end{bmatrix}, \end{aligned}$$
(9)

where \( q_{j}^{t}\) is the jth chromosome in the population of the tth generation, k represents the number of qubits coding each gene, and M is the number of genes on the chromosome which is equal to the number of weights to be optimized. The qubit code of each individual in the population is initialized to

$$\begin{aligned} \begin{bmatrix}\alpha \\ \beta \end{bmatrix}=\begin{bmatrix} \frac{1}{\sqrt{2}} \\ \frac{1}{\sqrt{2}} \end{bmatrix}, \end{aligned}$$
(10)

which means that all possible states expressed by each chromosome are equal. Besides, the prediction output of each DPtree \(T_{i}(i=1,2,\ldots ,M)\) on sample x is expressed as a C-dimensional vector \((T_{i}^{1}(x), T_{i}^{2}(x),\ldots , T_{i}^{C}(x),)\), where \(T_{i}^{c}(x)\) represents the output of \(T_{i}\) on category c. Then, the fitness function in the MPQGA is as follows:

$$\begin{aligned} f(w)=\frac{1}{E}, \end{aligned}$$
(11)

where

$$\begin{aligned} \begin{aligned}&E =\frac{1}{|D_{tr}|}\sum _{(x,y)\in D_{tr}} \mathbb {1}(T(x)\ne y),\\&T(x)=C_{arg max_{c}\sum _{i=1}^{M}w_{i}T_{i}^{c}(x)}.\\ \end{aligned} \end{aligned}$$
(12)

E and T(x) is the error and the prediction label of the ensemble model individually. The ensemble model based on MPQGA is shown in Algorithm 3 and the overview of it is shown in Fig. 2.

Theorem 4

En-DPtree satisfies \(\epsilon \)-DP.

Proof

We have proven that DPtree satisfies \(\epsilon \)-DP. Besides, each private tree only depends on the corresponding training subsets; different trees are built on disjoint data subsets. According to the parallel composition of DP, the privacy budget consumed by all the base learners is still \(\epsilon \). Besides, because DP is immune to post-processing, the voting step of the ensemble model will not damage the degree of privacy protection. Obviously, En-DPtree satisfies \(\epsilon \)-DP. \(\square \)

figure c

Time complexity analysis

Suppose the maximal depth of DPtree is d, the number of attributes is \(|\mathcal {A}|\), and the number of data is |D|. When the optimal splitting attribute is selected on the branch node, the time complexity is \(O(|\mathcal {A}|\cdot |D|)\). Besides, the time complexity of privacy budget allocation is O(1). Therefore, the time complexity of DPtree is \(O(d\cdot |\mathcal {A}|\cdot |D|)\). In En-DPtree, M DPtree models need to be constructed, so the time complexity of En-DPtree is \(O(M\cdot d\cdot |\mathcal {A}|\cdot |D|)\).

Performance evaluation

To test the property of the proposed En-DPtree, we select four datasets from the UCI Machine Learning Repository [42], as shown in Table 1. In this section, we select accuracy, F1, and micro-F1 to evaluate the classification ability of the classifiers. In the following, we first analyze the impact of privacy budget allocation strategy on classification performance through comparing our proposed DPtree and En-DPtree with Maxtree [19], Maxforest [19], En-RDT [21], DiffP-RFs [25], DP-AdaBoost, and AdaBoost. Second, we test the effects of the ensemble learning method based on MPQGA. Ultimately, we examine how the classification accuracy of En-DPtree changes with the number of iterations of MPQGA.

Table 1 Specific information about datasets

Comparison with other competitive classifiers

Privacy budget is a measure of the privacy protection capability. The smaller the privacy budget, the stronger the privacy protection capability of the algorithm, but the classification ability is usually reduced due to the introduction of a large number of noise. Figures 3 and 4 show the experimental results of algorithms DPtree, En-DPtree, Maxtree, Maxforest, En-RDT, DiffP-RFs, DP-AdaBoost, and AdaBoost on datasets in Table 1 under privacy budget from 0.1 to 1.

Fig. 3
figure 3

Impact of privacy budget \(\epsilon \) on classification accuracy on datasets

Fig. 4
figure 4

Impact of privacy budget \(\epsilon \) on F1 or micro-F1 on datasets

Obviously, in the binary classification tasks, the performances of classification algorithms on Mushroom are better than that on Adult. The possible reason is that all attributes of Mushroom are discrete, while Adult has both discrete and continuous attributes. Each algorithm needs to discretize the continuous attributes of Adult to classify. The discretization process may lead to the loss of underlying information or insufficient extraction of features, resulting in the reduction of classification ability. In addition, if the algorithm consumes the privacy budget in the process of discretizing continuous attributes, it will reduce the privacy budget in the process of selecting split attributes, which will affect the classification accuracy of the algorithm.

There is no differential privacy involved in the Adaboost algorithm, so its classification capability remains unchanged. It can be seen from Figs. 3 and 4 that with the increase of privacy budget, the classification ability of each classifier based on differential privacy is increasing as a whole. This is because when the privacy budget increases, the noise in each classifier decreases. However, it leads to decline of the classifier’s privacy protection ability. Therefore, in practical application, we usually need to abandon a certain classification accuracy to ensure the security of the classification model. With the increase of privacy budget, the classification ability of En-RDT becomes stronger rapidly. Compared with other algorithms, the classification ability of En-RDT on Mushroom and Nursery is relatively high, while the classification ability on Adult and Car is relatively lower as a whole. The reason for the instability of classification may be that every DT of En-RDT is constructed by randomly selecting split attributes. Besides, Maxtree and DPtree have relatively poor classification ability. The main reason is that they are two single classifiers, and the introduction of DP adds a lot of noise to the node generation process of trees, which interferes with the experimental results. The classification ability of Maxforest and En-DPtree is greatly improved on the basis of Maxtree and DPtree. Ensemble learning comprehensively considers the learning results of multiple base learners in an appropriate voting way, so that it can usually improve the classification ability of base learners in most cases by training multiple learners. The larger the privacy budget \(\epsilon \), the closer En-DPtree’s classification capability is to AdaBoost. On the whole, En-DPtree have the strongest classification ability compared with other famous models based on differential privacy and it shows that our improvement measures for Maxtree are effective.

Effect of the weighted voting scheme based on MPQGA

To examine the necessity of the weighted voting scheme based on MPQGA, we perform experiments by DPtree, DPtree-1, DPtree-2, and En-DPtree based on the above datasets. DPtree-1 and DPtree-2, respectively, mean that the voting scheme of ensemble learning is based on the majority voting and original QGA.

The classification results are shown in Figs. 5 and 6. The classification ability of DPtree-1, DPtree-2, and En-DPtree is always better than that of DPtree. Therefore, it is necessary to design an ensemble learning process for DPtree. Moreover, En-DPtree has the strongest classification ability, and the performance of DPtree-2 is better than DPtree-1 in these datasets. QGA can search appropriate weights for base classifiers through evolution process, so that DPtree-2 can obtain better classification ability than DPtree-1. Besides, compared with QGA which has only one population in evolutionary process, MPQGA evolves based on multiple populations. Therefore, MPQGA has stronger global search capability and can assist En-DPtree to obtain the strongest classification capability. Therefore, the improvement measures we designed in En-DPtree are effective.

Fig. 5
figure 5

Impact of the voting scheme on classification accuracy on datasets

Fig. 6
figure 6

Impact of the voting scheme on F1 or micro-F1 on datasets

Effect of the number of iterations in MPQGA

To illustrate the effect of the number of iterations of MPQGA on finding better weights, we apply En-DPtree based on the above datasets, as shown in Figs. 7 and 8. The horizontal axis is the number of iterations of MPQGA, and the vertical axis is the classification accuracy, F1 or micro-F1 of En-DPtree. It can be seen that when the number of iterations increases, the classification ability of En-DPtree shows an overall upward trend. In addition, the classification ability of En-DPtree changes greatly when the number of iterations is low. The possible reason is that MPQGA cannot find a stable evolution point in the early stage of iteration, and it still needs a series of gene changes to achieve a stable state. In general, MPQGA can obtain the optimal weights through a certain number of iterations. Therefore, using MPQGA in En-DPtree can search appropriate weights for base classifiers, so as to improve the classification ability of En-DPtree.

Fig. 7
figure 7

Impact of MPQGA iterations on classification accuracy on datasets

Fig. 8
figure 8

Impact of MPQGA iterations on F1 or micro-F1 on datasets

Conclusion

By analyzing the problems existing in Maxtree, we propose a new privacy budget allocation strategy and introduce quantum ensemble learning to form En-DPtree scheme. Compared with the existing works, we introduce the Fayyad theorem to quickly locate the best partition points of continuous attributes. In addition, we design an adaptive privacy budget allocation strategy for each leaf node according to its sample category distribution which can not only ensure that the classification result of each leaf node is not distorted, but also improve the privacy protection ability. Moreover, we design MPQGA to generate a set of better weights for ensemble model, so as to improve the classification performance of En-DPtree. Finally, we carry out several experiments on datasets. The experimental results show that the classification ability of En-DPtree is usually superior to other state-of-the-art classifiers. Besides, En-DPtree can obtain the optimal weights through a certain number of genetic iterations.