Keywords

1 Introduction

Frequent itemset mining can find valuable knowledge from mass data, but mining sensitive data may reveal individual privacy. For example, analysis of search logs can acquire the behavior of user’s page click, then get their interests in privacy. Therefore, it is necessary to introduce privacy protection mechanism into frequent itemset mining.

Differential privacy [1, 2] is a privacy protection technology that adds noise to query request or analysis results, it is not affected by attacker’s background knowledge, and guarantees that adding or removing one transaction has little effect on the query results.

The research of frequent itemset mining algorithm has made great progress with differential privacy. Bhaskar et al. [3] applied Laplace mechanism (LM) to compute noisy supports of all possible frequent itemsets, and then publish the top-k frequent itemsets with the highest noisy supports. Zeng et al. [4] analyze the effect of transaction length on global sensitivity, then they propose transaction truncating and heuristic method. Zhang et al. [5] adopt EM to select the top-k frequent itemsets. In order to boost availability of the noisy supports, they propose the technique of consistency constraints.

An effective frequent itemset mining algorithm with differential privacy should guarantee a certain privacy, then it tries to improve the availability of frequent itemsets. According to SCO, the transaction length is proportional to Laplace noise, how to reduce the length of long transactions is the key point for a transaction database, the approach reduces some noisy errors, but it results in loss of items and brings more truncation errors at the same time. So the challenge is how to balance both noisy errors and truncation errors, the main contributions of this paper are as follows.

  1. (1)

    In order to improve privacy protection of frequent itemsets, we propose the algorithm FI-DPTT, it perturbs real supports of top-k frequent itemsets by Laplace noise.

  2. (2)

    In order to improve the availability of frequent itemsets under differential privacy, we propose a quality function which balances both noisy errors and truncation errors in EM, it draws on the idea of Median to find the optimal transaction length.

2 Preliminaries

2.1 Differential Privacy

Definition 1 (Neighboring Databases).

Two transaction databases D1 and D2 are neighboring databases, if and only if we can obtain one from the other by adding or removing one transaction, such that \( |\text{D}_{1} - \text{D}_{2} |\; = \;1 \).

Definition 2

(ε-Differential Privacy [1]). Let be an algorithm of privacy protection, satisfies ε-differential privacy, if and only if for any pair of neighboring databases D1 and D2, and any output O of , we have:

(1)

In the above definition, denotes that outputs the probability of being O, ε is called the privacy budget, which controls the strength of privacy protection. A smaller ε leads to stricter privacy protection and vice versa.

2.2 Noisy Mechanism

Definition 3

(Global Sensitivity [1]). Given a query function Q with numerical outputs O, the global sensitivity of Q is ΔQ:

$$ \Delta \text{Q}\;\text{ = }\;\text{max}_{{\text{D}_{1} ,\;\text{D}_{2} }} \;\left| {\text{Q}(\text{D}_{1} )\; - \;\text{Q(D}_{2} \text{)}} \right| $$
(2)

D1 and D2 are arbitrary neighboring databases, ΔQ denotes the most distance between Q(D1) and Q(D2), global sensitivity is independent for arbitrary transaction databases.

Definition 4

(Sensitivity for Counting Occurrences (SCO) [6]). Given a transaction database D with the longest transaction length lmax, then for a query Q = {p1, p2, …, pn} which for each itemset pi of length in the range I = [Qmin, Qmax] computes the number of occurrences in D, global sensitivity ΔQ = ΔI × lmax, where ΔI = Qmax – Qmin + 1.

SCO is proportional to the maximum transaction length from Definition 4. If there is an only one long transaction, we need add much Laplace noise to frequent itemsets.

Definition 5

(Laplace Mechanism (LM) [7]). Given a query Q(D) → O, if the output of algorithm satisfies Eq. (3), then the enforces ε-differential privacy.

(3)

Lapi(ΔQ/ε)(1≤i≤n) is independent Laplace noise mutually, The Laplace parameter is ΔQ/ε, the Laplace noise is proportional to ΔQ and inversely proportional to ε. The idea is that we add Laplace noise to the real output values for privacy protection.

Definition 6

(Exponential Mechanism (EM) [8]). We design a quality function u(p, D), if algorithm satisfies Eq. (4), then algorithm enforces ε-differential privacy.

(4)

Where Δu denotes global sensitivity of quality function u(p, D). The key point is how to design a quality function u(p, D), p denotes the selected items from the output fields O. A larger \( \exp \left( {\frac{\varepsilon \times u(p,\;D)}{2 \times \Delta u}} \right) \) leads to higher probability that is selected as output.

2.3 Availability Analysis

Definition 7

(False Negative Rate (FNR) [5]). Let TPk(D) be top-k frequent itemsets in the database D, FNR measures the ratio that the real top-k frequent itemsets are in TPk(D) and not in TPk(Dt). A smaller FNR leads to higher data accuracy.

$$ \text{FNR} = \;\frac{{|\text{TP}_{\text{k}} (\text{D})\; \cup \;\text{TP}_{\text{k}} (\text{D}_{\text{t}} ) - \;\text{TP}_{\text{k}} (\text{D}_{\text{t}} )|}}{\text{k}} $$
(5)

Definition 8

(Average Relative Error (ARE) [5]). It measures the errors that we add Laplace noise to top-k frequent itemsets in database D. Where TC(pi, TPk(D)) denotes real supports of the frequent itemset pi in database D. NC(pi, TPk(Dt)) denotes noisy supports of frequent itemset pi, If pi is not in TPk(Dt), we set NC(pi, TPk(Dt)) = 0. A smaller ARE leads to higher data accuracy.

$$ \text{ARE}\; = \;\frac{{\sum\nolimits_{{\text{P}_{\text{i}} \; \in \;\text{TP}_{\text{k}} (\text{D})}} {\tfrac{{|\text{TC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D))}\; - \;\text{NC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D}_{\text{t}} \text{)) |}}}{{\text{TC(p}_{\text{i}} \text{,}\;\text{TP}_{\text{k}} \text{(D))}}}} }}{\text{k}} $$
(6)

3 Proposed Algorithm

3.1 Idea of Transaction Truncation

We define the optimal transaction length. Total errors are the sum of noisy errors and truncation errors, we truncate an original transaction database D into the transaction database Dt, the total errors which we generate frequent itemsets in the Dt under ε-differential privacy are the smallest than any other truncated database, so the longest transaction length in the database Dt is the optimal transaction length in the database D.

3.2 Algorithm Description

figure d

In order to reduce truncation errors, Apriori method is performed first to get candidates of 1-frequent itemsets and their supports, and then items of each transaction is ranked in descending order with supports to get the database \( \text{D}' \) (Step 1), when we truncate a transaction database. ε (Step 2) is allocated to two steps ε1 (Step 3) and ε2 (Step 5) on average. The database \( \text{D}' \) is truncated into Dt by lopt (Step 4).

3.3 Interpretation of Important Processes

For the algorithm FI-DPTT, two important procedures are interpreted as follows.

figure e

Procedure SelectOptLen draws on the characteristic of Median [9] that describes the trend of transaction records, it is rarely influenced by extreme values. We scan the database \( \text{D}' \) to obtain length of each transaction, then adopt EM to get lopt. A quality function \( \text{u}(\text{t},\;\text{D}')\; = \;\frac{{\text{count}_{\text{t}} }}{{|\text{rank(t) - SCALE} \times \text{|D'||}|}} \). If rank(t) = SCALE × |\( \text{D}' \)|, we set u(t, \( \text{D}' \)) = 2×countt, countt denotes the supports of the last item in the current transaction record, rank(t) denotes the location where t is ranked in ascending order from the database \( \text{D}' \). Δu(t, \( \text{D}' \)) is affected one at most. Because we add or remove one transaction record from the database \( \text{D}' \), the global sensitivity of u(t, \( \text{D}' \)) is one, that is, Δu(t, D′) = 1.

figure f

Procedure Perturb-Frequency generates frequent itemsets and add Laplace noise to real supports. Let c(pi) is real supports of a frequent itemset pi, ct(pi) is the supports that is added Laplace noise. lopt is the global sensitivity of frequent itemsets in the database Dt.

4 Experimental Evaluation

4.1 Experimental Setting

This section evaluates FI-DPTT algorithm on the data availability with DP-topkP [5]. Experimental environment is Inter Core i5-2410 M, CPU 2.30 GHz, 4 GB memory, Windows 7 and datasets PUMSB-STAR, RETAIL and KOSARAK [10]. FNR and ARE are used for data analysis. We repeat the experiment for five times and get the average (Table 1).

Table 1. Description of three datasets.

4.2 Experimental Result Analysis

We fix k = 100 and ε = 1.0 to analyze the impact of SCALE on availability. When we set SCALE = 0.85 in Fig. 1, it ensures the best availability. A smaller SCALE leads to increase in truncation errors and reduce in noisy errors, total errors tends to increase and vice versa. Furthermore, the effect of truncation errors is greater than noisy errors on availability. We fix SCALE = 0.85 in the follow-up experiments.

Fig. 1.
figure 1

The relationship between SCALE and availability

We fix k = 100 to analyze the impact of ε on availability in Fig. 2. When ε < 1, FI-DPTT achieves lower FNR than DP-topkP, because we give priority to reducing truncation errors, FNR is only related to truncation errors. FI-DPTT achieves lower ARE than DP-topkP, ARE is related to both truncation errors and noisy errors, it will be larger than FNR. It shows that the availability of FI-DPTT is better than DP-topkP.

Fig. 2.
figure 2

The relationship between ε and availability when ε changes

From Fig. 3, we fix ε = 1 to analyze the impact of k on availability. With the increase of k, the availability of two algorithms will reduce, because it leads to smaller threshold λ, it makes both truncation errors and noisy errors increase.

Fig. 3.
figure 3

The relationship between k and availability when k changes in the dataset KOSARAK

5 Conclusion

If there are some long transactions in a transaction database, it makes the availability of frequent itemsets reduced under differential privacy. The algorithm FI-DPTT combines exponential mechanism with Laplace mechanism. In order to improve the availability of frequent itemsets under differential privacy, a quality function of exponential mechanism is designed to balance truncation errors and noisy errors, then Laplace noise is added to the real supports of frequent itemsets. The proposed algorithm can gain better performance on both data availability and privacy.