1 Introduction

Comprehensive efforts have been made to establish an intelligent urban environment by developing sophisticated digital twins. Digital twins encapsulate the diverse functionalities of urban landscapes and accurately represent the behavioral patterns of their inhabitants. By creating a virtual counterpart to a city from data gathered in the real world, a myriad of urban-related events and processes, including disaster mitigation, transportation management, and pandemic response, are simulated within the digital realm. Subsequently, the insights gleaned from these simulations can be fed back into the physical world to facilitate well-informed decision-making and foster more sustainable urban development.

However, the construction of intricate digital twins and the realization of a truly smart city require the collection of detailed information about the behaviors and characteristics of individuals within the physical world. For instance, to model the specific attributes that people possess and the actions they undertake, a variety of personal attributes, such as age, gender, occupation, and income levels, must be gathered and analyzed [47]. Consequently, the protection of privacy has emerged as a critical concern that must be addressed in order to implement digital twin technology and achieve the vision of a smart city [28].

During the collection of information pertaining to an individual’s attributes and behaviors in relation to their environment, it is crucial to ensure that the privacy of each person is adequately protected. Additionally, as humans are inherently social beings who interact with one another, the development of a human-centric digital twin requires careful consideration of the interactions between individuals and the associated information about each person [61]. Moreover, accounting for the potential measurement noise and missing values that may arise from sensing errors is essential when addressing individual privacy [55, 60].

Regrettably, existing privacy-preserving data mining solutions have neglected to consider the impact of measurement noise and missing values, which has led to low accuracy in data analysis. Furthermore, the lack of consideration for human interactions has resulted in increased privacy leakage beyond anticipated levels. This chapter aims to address three primary concerns: the loss of accuracy due to missing data, the loss of accuracy caused by observation noise, and the heightened privacy leakage that results from human interaction. These challenges are particularly pronounced in the context of a smart city environment. The content of this chapter is grounded in the author’s previous publications [52, 54, 55, 60]. There are several other issues concerning LDP for smart cities. My previous articles have addressed these issues [56, 57, 59].

In this chapter, local differential privacy (LDP) [12] serves as the principal metric for evaluating privacy. LDP is a highly significant privacy-preserving technique that has been widely adopted to protect user data while enabling meaningful analysis. As a variant of differential privacy [11], LDP offers robust privacy guarantees for individual data points by introducing randomness directly at the data source, prior to any data being shared with an aggregator or analyst. Several prominent examples of LDP in action can be found in industry applications. For instance, Apple leverages LDP in its data collection processes to ensure user information remains private and secure [3]. Similarly, Google employs LDP in its RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) project, which collects anonymized statistics from user browsers while preserving privacy [13]. The definition of LDP will be detailed in Sect. 2.3.

1.1 Purpose of This Research

The goal is to safely obtain people’s attributes and behaviors under the LDP and analyze them with high statistical accuracy to realize smart cities. Detailed personal information is not needed for advanced smart cities; global information is sufficient. Laws that protect privacy have been enacted for countries, such as Japan’s Personal Information Protection Law and Europe’s General Data Protection Regulation (GDPR). Therefore, it is necessary not to violate people’s privacy. On the other hand, it is impossible to achieve privacy protection that is 100% safe. The LDP can control the amount of privacy leakage by adjusting the value of \(\epsilon \), which represents privacy loss. This \(\epsilon \) value can be specified by system administrators or by individuals. Within this range, personal data is collected from people and statistically analyzed. Research on LDP has been actively conducted in the past decade, but as mentioned in Sect. 1, there have been several challenges. The main goal of this chapter is to solve these challenges as follows.

  • Treating measurement noise under LDP

  • Treating missing values under LDP

  • Treating human-to-human interactions under LDP.

Whereas the first two challenges have an effect on the accuracy of statistical analysis, the third is related to privacy leakage.

1.2 Structure of This Chapter

Section 2 commences with a presentation of motivational examples that emphasize the necessity of personal information for realizing advanced smart cities while concurrently underscoring the importance of privacy protection. This section also acknowledges that personal information is frequently gathered from sensors integrated into IoT systems and smartphones, and that this may result in inaccurate or missing data. Lastly, this section introduces the privacy protection metric utilized throughout the paper, which is LDP.

Section 3 explores the treatment of observational error in an LDP context. Although privacy-preserving data mining has been investigated extensively over the past decade, limited attention has been devoted to error in data values. LDP can be achieved by adding privacy noise to a target value that should be protected. However, if the target value already contains measurement error, the amount of privacy noise to add can be reduced. This section proposes a novel privacy model called true-value-based differential privacy (TDP). This model applies traditional differential privacy to the “true value”, which is not known by the data owner or anonymizer, but not to the “measured value” that contains error. By leveraging TDP, our solution reduces the amount of noise to be added by LDP techniques by approximately 20%. Consequently, the error of generated histograms is reduced by 40.4 and 29.6% on average.

Section 4 discusses the processing of missing values in an LDP setting. Privacy-preserving data mining techniques are valuable for analyzing diverse types of information, such as COVID-19-related patient data. Nonetheless, collecting substantial amounts of sensitive personal information poses a challenge. Moreover, this information may contain missing values, and this fact is not considered in existing methods that ensure data privacy while collecting personal information. Neglecting missing values diminishes the accuracy of data analysis. In this paper, we propose a method for privacy-preserving data collection that accounts for various types of missing values. Patient data are anonymized and transmitted to a data collection server. The data collection server generates a generative model and a contingency table suitable for multi-attribute analysis based on expectation-maximization and Gaussian copula methods. We conduct experiments on synthetic and real data, including COVID-19-related data. The results are 50–80% more accurate than those of existing methods that do not consider missing values.

Section 5 examines the management of human interactions in an LDP environment. Under LDP, a privacy budget is allocated to each user. Each time a user’s data are collected, some of the user’s privacy budget is consumed, and their privacy is protected by ensuring that the remaining privacy budget is greater than or equal to zero. Organizations and previous studies assume that an individual’s data are entirely unrelated to another individuals’ data. However, this assumption is invalid in situations where data for an interaction between two or more users are collected from those users. In such cases, each user’s privacy is inadequately protected because their privacy budget is, in fact, overspent. In this study, we clarify the problem of LDP for person-to-person interactions. We propose a mechanism that satisfies LDP in a person-to-person interaction scenario. Mathematical analysis and experimental results demonstrate that the proposed mechanism maintains higher data utility while ensuring LDP than do existing methods.

2 Background

2.1 Motivating Examples

At present, IoT devices are capable of collecting and estimating various kinds of attribute information about individuals, including location, heart rate, health status, age, and movement patterns [70]. By leveraging this attribute data, individuals can access a wide range of services, such as recommender systems for smart cities. Additionally, the data collector can function as a data anonymizer, anonymizing the acquired data and transmitting it to the data receiver (refer to Fig. 1).

Two types of attribute data are considered in this context: the first comprises numerical attributes, such as heart rate measured in beats per minute, whereas the second encompasses categorical attributes, such as disease names (e.g., COVID-19).

The gathered attribute data often contain sensing errors, as accurately sensing and estimating the attributes of individuals can be challenging. In the most unfavorable circumstances, attribute data may not be collectable at all. Missing data can be approximated using techniques like multiple imputation or predictions based on regression models [81]. However, these estimated values tend to exhibit a significant degree of error.

Fig. 1
A network diagram. The data collector or anonymizer gets data from the data owner, C C T Vs, and other sensors, and sends anonymized data to a server.

Data receiver collects user data from people and/or sensing platforms under LDP

In recent years, generative AI technologies such as ChatGPT and Stable Diffusion have undergone rapid advancements. While the majority of training data for these models are publicly sourced from the web, it is anticipated that future generative AI models will increasingly engage in the direct collection and training of data from individuals. The methods proposed in this paper are particularly well-suited for these emerging scenarios.

2.2 Attack Model

We assume an honest-but-curious adversary. That is, the adversary follows the protocol and rules of the system but attempts to learn as much as possible about individual users from the available data. They do not actively manipulate or tamper with the data but try to exploit the information they can access within the system’s constraints.

Furthermore, each anonymized datum may contain original sensing error or intentionally added noise; therefore, the attacker cannot accurately estimate people’s true data but can estimate the probability distribution of the data.

2.3 Local Differential Privacy (LDP)

In technical terms, LDP is defined as \(\epsilon \)-LDP, where parameter \(\epsilon \) represents a privacy budget. There are several relaxation concepts related to \(\epsilon \)-LDP, such as \((\epsilon ,\delta )\)-LDP and Renyi differential privacy [38]. Although the concepts discussed in this chapter can be applied to the relaxations of LDP, we focus on \(\epsilon \)-LDP to simplify the discussion. \(\epsilon \)-LDP is defined as follows.

Definition 1 (\(\epsilon \)-LDP)

[\(\epsilon \)-LDP] Let X represent the domain of a user’s data, and let Y be an arbitrary set. A randomized mechanism M provides \(\epsilon \)-LDP if and only if for any \(x,x'\in X\) and any \(y\in Y\),

$$\begin{aligned} P(M(x)=y) \le e^\epsilon P(M(x')=y). \end{aligned}$$
(1)

Several techniques have been proposed for achieving LDP. One of the most commonly used techniques is the Laplace mechanism [11]. To introduce the Laplace mechanism, we first define the concept of global sensitivity.

Definition 2 (Global sensitivity)

[Global sensitivity] For a function \(f:X\rightarrow Y\), the global sensitivity of f is defined as follows.

$$\begin{aligned} \Delta f = \max _{x,x'\in X} |f(x)-f(x')|. \end{aligned}$$
(2)

Theorem 1 (Laplace mechanism[11])

[Laplace mechanism[11]] Let \(\Delta f\) be the global sensitivity of a function \(f:X\rightarrow Y\) and let \(\mathcal {L}(v)\) represent the Laplace distribution, with a mean of zero and the scale parameter as v. The following mechanism M ensures \(\epsilon \)-LDP.

$$\begin{aligned} M(x) = f(x) + \mathcal {L}(\frac{\Delta f}{\epsilon }). \end{aligned}$$
(3)

In the context of LDP, the magnitude of privacy safeguarding is modulated by the parameter \(\epsilon \). Deliberation on the appropriate selection of this value is beyond the purview of the present discourse; however, strategies such as the automatic determination predicated on the uniqueness of each attribute value [39] may be employed.

3 Measurement Noise Under LDP

3.1 Introduction

To realize smart cities, the collection and analysis of personal data through devices such as IoT is indispensable. However, it is crucial to consider the noise present in data measured from IoT devices. Additionally, the protection of privacy is imperative. In this section, we propose methodologies suitable for this scenario. In this section, an original value without error is referred to as a “true” value; the data owner or anonymizer may not have knowledge of these values. Conversely, sensed values that may contain errors are denoted as “measured” values. Existing studies on differential privacy do not consider true values, only measured values. Our study aims to investigate whether additional noise should be introduced to protect privacy if the target value already contains error. This research proposes a new privacy model that safeguards the true value rather than the measured value. Because the data owner may not be aware of the true value, it is assumed that the true data exhibit a specific probability distribution, such as the normal distribution. This probability distribution is based on the data owner’s or anonymizer’s knowledge or the theory of errors [67]. The distinction between the traditional approach to differential privacy and the proposed true-value-based differential privacy (TDP) is illustrated in Fig. 2. According to the concept of TDP, the amount of noise to add to the measured value can be reduced.

We assume that the anonymizer can estimate the distribution of measurement error to some degree. Therefore, TDP can be achieved even if the anonymizer’s estimation is inaccurate, as long as they do not overestimate the magnitude of the sensing error. The relationship between the anonymizer’s error distribution and the TDP is presented in Table 1. Consequently, if the anonymizer is uncertain about the error distribution, they can guarantee TDP by conservatively estimating the amount of error. If the amount of error is predicted to be zero, the outcome aligns with traditional differential privacy. Thus, TDP can decrease the amount of error introduced relative to traditional differential privacy while still achieving the desired privacy protection level specified by \(\epsilon \).

If we possess no information about the error distribution, the proposed method in this chapter cannot be employed. However, we believe that there are numerous situations where it is feasible to make estimates under the condition that we can underestimate the amount of error.

Fig. 2
A concept map for T D P. A true or unknown value plus measurement error gives a measured value, which is then added with traditional L D P noise to get an L D P value. True value-based differential privacy also leads to an L D P value.

Concept of true-value-based differential privacy (TDP). Traditional differential privacy adds LDP noise to the measured value. In contrast, TDP adds LDP noise to the true value after considering the measurement error

Table 1 Relationship between the error distribution knowledge and the TDP

The motivation, research gap, and contribution of this study are summarized below.

Motivation: This study aims to estimate the distribution of personal data sensed in IoT environments while protecting user data using differential privacy. We assume that the sensed data contains sensing noise.

Research gap: Existing methods do not take sensing noise into account. As a result, they introduce excessive privacy noise into the sensed data.

Contribution: First, we propose true-value-based differential privacy (TDP), a novel differential privacy concept that considers sensing noise. Second, we propose anonymization algorithms for numerical and categorical data that satisfy TDP. Third, we demonstrate that the proposed algorithms ensure TDP. Fourth, we show that the proposed algorithms can reduce the amount of differential privacy noise using synthetic and real datasets. Fifth, we illustrate that the proposed algorithms can decrease error in the estimated distribution for personal data using the same datasets.

3.2 Models

3.2.1 Assumptions

Anonymizers may not know the true values of an attribute, but they can estimate them. However, these estimated values may contain error. Anonymizers can also estimate the error distribution of numerical attribute values. The normal distribution is considered the error model for numerical attributes, as measurement errors follow the normal distributions in many cases [37]. The normal distribution is characterized by the parameter \(\sigma \), which represents its standard deviation. However, please note that the concept of TDP can be applied to other error models.

The probability of wrong classification \(p_{i \rightarrow j}\) is considered with reference to categorical attributes. This probability signifies that the ID of the true category is i. However, the anonymizer is unaware of the true category ID and assumes that the category ID is j.

In this section, parameters \(\sigma \) and \(p_{i\rightarrow j}\) for all ij are referred to as “error parameters.”

Three scenarios are assumed.

Scenario I: The anonymizer knows the exact error parameters.

Scenario II: The anonymizer does not know the exact error parameters. The estimated parameters may differ from the actual parameters; however, the anonymizer is not pessimistic about the degree of error. The mathematical definitions of the numerical attributes are described in Sect. 3.3.1, and those of the categorical attributes are described in Sect. 3.3.2.

Scenario III: The anonymizer does not know the exact error parameters and has no estimate for them.

In this chapter, we do not focus on Scenario III. As Scenario I is somewhat unrealistic, we generally focus on Scenario II.

3.2.2 Privacy Metric

Suppose that a person has an attribute value, and the person or anonymizer who collects the attribute value anonymizes the value. Let \(\epsilon \) be a positive real number. The, the differential privacy is defined as follows.

In this section, it is considered that the value of x may contain sensing error. Therefore, the focus must be placed on the true value of x, which is an unknown value, even for the data owner and the anonymizer. TDP is proposed to handle the privacy of unknown values.

Definition 3 (TDP)

[TDP] Let x and \(x'\) be true values and let \(\epsilon \) be a positive real number. A measurement function \(\mathcal {M}\) acquires an input x and outputs a measured value. A randomized mechanism \(\mathcal {A}\) satisfies TDP if and only if for any output y, the following equation holds:

$$\begin{aligned} P(\mathcal {A}(\mathcal {M}(x))=y) \le e^\epsilon P(\mathcal {A}(\mathcal {M}(x'))=y) \mathrm {\,\,\,\,\, for\,\, all \,\,\,\,} x, x'. \end{aligned}$$
(4)

Theorem 2

In an anonymized data collection scenario, Definition 1 is the same as Definition 3 when the measured values contain no error.

Proof

When the measured values contain no error, the equations \(x=\mathcal {M}(x)\) and \(x'=\mathcal {M}(x')\) hold. Therefore, in this case, Eqs. 1 and 4 are equivalent. \(\square \)

3.3 True-Value-Based Differential Privacy (TDP)

Table 2 Notation

Existing studies define x and \(x'\) in Definition 1 as measured values. In this section, they are defined as true values. The anonymization mechanisms for both numerical and categorical attributes are described next (Table 2).

3.3.1 Numerical Value Anonymization

The Laplace mechanism (Theorem 1), which adds noise based on the Laplace distribution, can be used for numerical attributes. However, the Laplace mechanism does not take sensing error into consideration. As a result, the noise of the normal distribution is added to true values as sensing error, and additional noise based on the Laplace mechanism is added to the noisy value. This is the traditional approach, which is referred to as the baseline approach for numerical attributes, and it always adds the Laplace noise. The resulting probability density function, which represents the probability of the distance between the final noisy and the true values, can be calculated by convoluting the normal and Laplace distributions.

Let \(\mathcal {N}(x; \sigma ^2)\), \(\mathcal {L}(x; b)\) represent the probability density functions of the normal distribution, with the standard deviation being \(\sigma \) and the scale parameter of the Laplace distribution being b. Centered distributions that peak at zero are only considered without loss of generality.

A convolution of the normal distribution with a standard deviation of \(\sigma \) and of the Laplace distribution with a scale parameter of b is represented by

$$\begin{aligned} \begin{aligned} &\mathcal {U}(x; \sigma ^2, b) = \mathcal {N}\star \mathcal {L}= \int _{t=-\infty }^\infty \mathcal {N}(t; \sigma ^2)\mathcal {L}(x-t; b) dt\\ &=\frac{e^{\frac{\sigma ^2-2 b x}{2 b^2}} \left( \text {erfc}\left( \frac{\sigma ^2-b x}{\sqrt{2} b \sigma }\right) +e^{\frac{2 x}{b}} \text {erfc}\left( \frac{\sigma ^2 + b x}{\sqrt{2} b \sigma }\right) \right) }{4 b} \end{aligned} \end{aligned}$$
(5)

where erfc is the complementary error function, which is represented by

$$\begin{aligned} \text {erfc}(x)=\frac{2}{\sqrt{\pi }} \int _x^\infty e^{-t^2}dt. \end{aligned}$$
(6)

It is noted that for Scenario II, the value of \(\sigma \) can be wrong, as long as it is not pessimistic. Let \(\sigma _t\) and \(\sigma \) represent the true standard deviation and the standard deviation assumed by the anonymizer, respectively. Here, pessimistic means that

$$\begin{aligned} \sigma > \sigma _t. \end{aligned}$$
(7)

\(\exp (\epsilon )\), \(1/\exp (\epsilon )\), and the ratio of the probability density function values whose distance is \(\Delta \) with respect to \(\mathcal {N}(x; \sigma ^2)\), \(\mathcal {L}(x; \Delta /\epsilon )\), and \(\mathcal {U}(x; \sigma ^2, \Delta /\epsilon )\), where \(\epsilon \) and \(\sigma \) are set to one, are presented in Fig. 3. The ratio of the probability density function values whose distance is \(\Delta \) with respect to the normal distribution is calculated by

$$\begin{aligned} R_{\mathcal {N}(x; \sigma ^2)} = \frac{\mathcal {N}(x+\Delta /2; \sigma ^2)}{\mathcal {N}(x-\Delta /2; \sigma ^2)}=e^{-\frac{\Delta x}{\sigma ^2}}. \end{aligned}$$
(8)

Equation 8 shows that \(R_{\mathcal {N}(x; \sigma ^2)}\) approaches \(\infty \) when x is close to \(-\infty \). Therefore, even if \(\sigma \) is very large, extra noise needs to be added to achieve \(\epsilon \)-differential privacy.

Similarly, in Fig. 3, \(R_{\mathcal {L}(x; \epsilon , \Delta )}\) and \(R_{\mathcal {U}(x; \sigma ^2, \epsilon , \Delta )}\) are defined as the ratio of the probability density function values whose distance is \(\Delta \) with respect to \(\mathcal {L}(x; \Delta /\epsilon )\) and \(\mathcal {U}(x; \sigma ^2, \Delta /\epsilon )\), respectively.

Fig. 3
A line graph has decreasing lines for the ratio of values of normal distribution R N of sigma square, ratio of values of Laplace distribution R L of epsilon delta, and ratio of normal distribution N of sigma square times Laplace distribution L of delta over epsilon R u sigma square, epsilon delta.

Ratio of probability density function values of the normal distribution and Laplace distribution, and the convolution of the two distributions (\(\sigma =\epsilon =\Delta =1\))

The ratio of the probability density function values whose distance is \(\Delta \) should appear between the lines of \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\), according to the definition of \(\epsilon \)-differential privacy. Figure 3 shows that \(R_{\mathcal {L}(x;\epsilon , \Delta )}\) and \(R_{\mathcal {U}(x; \sigma ^2, \epsilon , \Delta )}\) satisfy this condition; therefore, the \(\mathcal {L}(x; \Delta /\epsilon )\) and \(\mathcal {U}(x; \sigma ^2, \Delta /\epsilon )\) mechanisms achieve \(\epsilon \)-differential privacy (here \(\sigma =\Delta =\epsilon = 1\)). Although \(R_{\mathcal {U}(x; \sigma ^2, \Delta /\epsilon )}\) approaches \(\exp (\epsilon )\) (or \(1/\exp (\epsilon )\)) when |x| is large, its convergence to \(\exp (\epsilon )\) (or \(1/\exp (\epsilon )\)) is slower than that of \(R_{\mathcal {L}(x; \Delta /\epsilon )}\). Consequently, the mechanism adds much more noise than is required.

The algorithm proposed in this section is simple but effective; Laplace noise is not added when the calculated Laplace noise is smaller than the predefined threshold w. Thus, the total loss is expected to become smaller (i.e., the ratio of the probability density function values whose distance is \(\Delta \) is expected to approach \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) faster).

However, the definition of an appropriate value for w is complex. If the threshold w is very large, the resulting value cannot achieve either traditional \(\epsilon \)-differential privacy or TDP. Conversely, the resulting value contains unnecessary noise if the threshold w is very small.

The probability density function, which adds the Laplace noise only when the noise x satisfies \(\text {abs}(x) \ge w\)Footnote 1, is represented by

$$\begin{aligned} \widehat{\mathcal {L}}(x; b, w) = {\left\{ \begin{array}{ll} \int _{-w}^w \mathcal {L}(t; b) dt &{} x=0 \\ \frac{e^{-x/b}}{2b} &{} x\ge w \\ \frac{e^{x/b}}{2b} &{} x\le -w \\ 0&{} otherwise. \end{array}\right. } \end{aligned}$$
(9)

Therefore, the probability density function obtained from the original sensing error and the Laplace noise defined in Eq. 9 can be represented by

$$\begin{aligned} \begin{aligned} &\mathcal {V}(x; \sigma ^2, b,w) = \int _{-\infty }^\infty \mathcal {N}(t; \sigma ^2) \widehat{\mathcal {L}}(x-t; b, w)dt \\ &+\mathcal {N}(x; \sigma ^2) \int _{-w}^w\mathcal {L}(t; b) dt \\ &=\frac{e^{-\frac{w+x}{b}-\frac{x^2}{2 \sigma ^2}}}{4 b \sigma } \times \Bigg \{\sigma e^{\frac{1}{2} \left( \frac{2 b w+\sigma ^2}{b^2}+\frac{x^2}{\sigma ^2}\right) } \Big [\text {erfc}\left( \frac{b (w-x)+\sigma ^2}{\sqrt{2} b \sigma }\right) \\ &+e^{\frac{2 x}{b}} \text {erfc}\left( \frac{b (w+x)+\sigma ^2}{\sqrt{2} b \sigma }\right) \Big ]+2 \sqrt{\frac{2}{\pi }} b \left( e^{\frac{w}{b}}-1\right) e^{\frac{x}{b}}\Bigg \} \end{aligned}. \end{aligned}$$
(10)

For the proposed algorithm, the ratio of the probability density function values whose distance is \(\Delta \) is represented by the following:

$$\begin{aligned} R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta , w)} = \frac{\mathcal {V}(x+\Delta /2; \sigma ^2, \Delta /\epsilon , w)}{\mathcal {V}(x-\Delta /2; \sigma ^2, \Delta /\epsilon , w)} \end{aligned}$$
(11)

The objective is to find an appropriate value of w such that \(R_\mathcal {V}\) approximates \(\exp (\epsilon )\) but \(R_\mathcal {V}\) does not overestimate \(\exp (\epsilon )\) or \(1/\exp (\epsilon )\).

Fig. 4
A multiline graph plots a line for R N of sigma square, 4 lines for R V of sigma square, epsilon, delta w = 2.5, 2.0, 1.42 and 0.7, and a line for R U of sigma square, epsilon, delta that decreases below exponential of epsilon.

\(R_\mathcal {V}\) for various values of w (\(\sigma =\epsilon =\Delta =1\)). It can be seen that if the value of w is too large, the requirement for differential privacy is not met. Alternatively, it can be seen that if the value of w is too small, more noise is added than is necessary

The following theorem is considered (see Fig. 4):

Theorem 3

If w is near \(\infty \), the value of \(R_{\mathcal {V}}\) approaches the value of \(R_{\mathcal {N}}\). If w is near zero, the value of \(R_{\mathcal {V}}\) approaches the value of \(R_{\mathcal {U}}\).

Proof

\(\mathcal {U}(x; \sigma ^2, \Delta /\epsilon )\) (Eq. 5) and \(\mathcal {N}(x; \sigma ^2)\) (Eq. 8) can be obtained by calculating the limit of \(\mathcal {V}(x; \sigma ^2, \Delta /\epsilon ,w)\) (Eq. 10) of w as w approaches zero and \(\infty \), respectively.\(\square \)

The ratio between \(x + \Delta /2\) and \(x-\Delta /2\) is defined in this study; therefore, the range \(-w-\Delta /2 < x < 0\) can be considered to check whether or not the maximum ratio is greater than \(\exp (\epsilon )\). It is noted that only the range \(x<0\) needs to be checked because \(\mathcal {V}\) is symmetrical with respect to the point \((x,y)=(0,1)\), where y represents the ratio of the probability density function values whose distance is \(\Delta \).

Algorithm 1 describes the method that yields the anonymized value. In Algorithm 1, the value of w is calculated at Lines 1–16. \(\text {erfc}(x)\) can be computed using approximate equations, such as

$$\begin{aligned} \begin{aligned} &\text {erfc}(x) = 1- \text {erf}(x) \approx 1-\sqrt{1-e^{-x^2 \frac{4/\pi +0.147x^2}{1+0.147x^2} }}\\ &(\text {maximum relative error:\,} 1.3\cdot 10^{-4}) \end{aligned} \end{aligned}$$
(12)

when \(x \ge 0\) from [76]. Note that we can obtain an approximate value of \(\text {erfc}(x)\) with \(x<0\) from the property of

$$\begin{aligned} \text {erfc}(x) = 2-\text {erfc}(-x). \end{aligned}$$
(13)

After checking the approximate values, precise values must be calculated. Mathematical tools such as MaximaFootnote 2, which is a popular free software program, can be employed.

A 22-line algorithm for the proposed randomization mechanism for numerical attributes. The inputs are privacy budget epsilon, standard deviation of the normal distribution for sensing error sigma, range of possible values delta, and measured value v s. The T D P value is the output.

3.3.2 Categorical Values Anonymization

The randomized response mechanism [75] can be used for categorical attributes. First, a sensed value is categorized into one of the predefined categories. Another category replaces that category with a certain probability, and then the resulting category ID is sent to the data receiver. The randomized response is referred to as the baseline approach for categorical attributes.

The retention probability of an unchanging category ID is \(p_\alpha \), and the probabilities of other IDs are \((1-p_\alpha )/(M-1)\), where M is the number of categories. The equation

$$\begin{aligned} \max \left( \frac{p_\alpha }{(1-p_\alpha )/(M-1)}, \frac{(1-p_\alpha )/(M-1)}{p_\alpha }\right) \le e^{\epsilon } \end{aligned}$$
(14)

should hold to satisfy \(\epsilon \)-differential privacy. Therefore, the following is set:

$$\begin{aligned} p_\alpha = e^\epsilon /(M-1+e^\epsilon ). \end{aligned}$$
(15)

Because \(M\ge 2\), \(p_\alpha >0.5\) is obtained.

Let \(p_{i\rightarrow j}\) represent the probability that the true category ID \(C_i\) is (mis-)classified to \(C_j\) due to sensing error. It is assumed that the retention probability is greater than any other probability; that is, the following equation is assumed:

$$\begin{aligned} p_{i\rightarrow i} > \max _{j \ne i} p_{i\rightarrow j}. \end{aligned}$$
(16)

It is assumed that the values of \(p_{i\rightarrow j}\) for all i, j can be estimated. Let

$$\begin{aligned} \boldsymbol{p_i}= \{ p_{i \rightarrow 1}, p_{i\rightarrow 2}, \ldots , p_{i\rightarrow M} \}. \end{aligned}$$
(17)

For Scenario II, these values can be wrong, as long as they are not pessimistic.

Let \(p_{i \rightarrow j, t}\) and \(p_{i \rightarrow j}\) represent the true probability and the probability that the anonymizer assumes, respectively. Here, pessimistic estimation means that

$$\begin{aligned} {\left\{ \begin{array}{ll} &{}p_{i \rightarrow i} < p_{i \rightarrow i, t} \text {\,\,for\,\,any\,\,} i,\\ &{}p_{i \rightarrow j} > p_{i \rightarrow j, t} \text {\,\,for\,\,any\,\,} i, j (i\ne j). \end{array}\right. } \end{aligned}$$
(18)

First, the following expression is satisfied:

$$\begin{aligned} \frac{p_{i \rightarrow j}}{p_{i' \rightarrow j}} \le e^\epsilon \text {\,\,for all\,\,} i,i',j. \end{aligned}$$
(19)

This case clearly holds TDP. In this case, the random mechanism \(\mathcal {A}\) in Definition 3 does not need to do anything. In other words, the TDP can be satisfied by outputting the measured input values as they are.

If Eq. 19 is not satisfied, the following simultaneous equations with respect to \(x_{i\rightarrow j}\) for all i and j are solved:

$$\begin{aligned} \begin{aligned} \boldsymbol{p_i} \cdot \boldsymbol{x_i} &= p_\alpha \,\,\,\,\textrm{for}\,\,\,\, i=1, \ldots , M, \\ \boldsymbol{p_i} \cdot \boldsymbol{x_j} &= \frac{1-p_\alpha }{M-1} \,\,\,\,\textrm{for}\,\,\,\, i,j=1, \ldots , M \,\,\,\,\mathrm {s.t.}\,\,\,\, i\ne j, \end{aligned} \end{aligned}$$
(20)

where

$$\begin{aligned} \boldsymbol{x_i} = \{x_{1\rightarrow i}, x_{2\rightarrow i}, \ldots , x_{M\rightarrow i} \} \end{aligned}$$
(21)

and \(\cdot \) represents the scalar product of two vectors.

The value of \(x_{i \rightarrow i}\) may be greater than one, and the value of \(x_{i \rightarrow j}\) may be less than zero. Therefore, the obtained values are normalized by

$$\begin{aligned} \begin{aligned} x_{i\rightarrow i} &\leftarrow \min (1, x_{i \rightarrow i})\,\,\,\,\textrm{for}\,\,\,\, i=1, \ldots , M, \\ x_{i\rightarrow j} &\leftarrow \max (0, x_{i \rightarrow j})\,\,\,\,\textrm{for}\,\,\,\, i,j=1, \ldots , M \,\,\,\,\mathrm {s.t.}\,\,\,\, i\ne j. \end{aligned} \end{aligned}$$
(22)

Finally, when the measured category ID is \(C_i\), the anonymizer generates the anonymized version \(C_j\) with probability \(x_{i\rightarrow j}\).

Algorithm 2 shows the method that yields the anonymized category ID.

An 8-line algorithm for the proposed randomization mechanism for categorical attributes. Inputs are privacy budget epsilon, probabilities p i to j for all i and j, measured category I D s, and I Ds of categories K. Outputs are T D P values of scenarios 1 and 2. Equations 15, 19, 20 and 22 are used.

3.3.3 Proof of Achieving True Value-Based Differential Privacy

Next, it is proved that the proposed algorithms (for Scenarios I and II) realize TDP.

Numerical Attributes First, Scenario I is considered. Because Algorithm 1 ensures that \(1/\exp (\epsilon ) \le R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta ,w)} \le \exp (\epsilon )\) for the true value if \(\sigma \) is correct, it achieves TDP based on Definition 3.

Next, Scenario II is considered. It is assumed that the anonymizer’s knowledge about the sensing error is not correct, but that their assumption about the measurement error is not pessimistic. The concept “pessimistic” is defined in Eq. 7 in relation to numerical attributes.

Let the ratio of the probability density function values whose distance is \(\Delta \) with respect to \(\mathcal {N}(x; \sigma ^2)\) be \(R_{\mathcal {N}(x; \sigma ^2)}\). By differentiating \(R_{\mathcal {N}(x; \sigma ^2)}\) with respect to \(\sigma \), we obtain

$$\begin{aligned} \frac{\partial R_{\mathcal {N}(x; \sigma ^2)}}{\partial \sigma } = \frac{2 \Delta e^{-\frac{\Delta x}{\sigma ^2}}x}{\sigma ^3}. \end{aligned}$$
(23)

When x is less than zero, the value of differentiating \(R_{\mathcal {N}(x; \sigma ^2)}\) with respect to \(\sigma \) is always less than zero. Therefore, if \(\sigma \) becomes larger, the value of \(R_{\mathcal {N}(x; \sigma ^2)}\) becomes smaller. It can be concluded that \(R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta ,w)}\) becomes smaller when \(\sigma \) becomes larger, because the proposed probability density function \(\mathcal {V}(x; \sigma ^2, \Delta \epsilon , w)\) is a convolutional function of \(\mathcal {N}(x; \sigma ^2)\) and Eq. 9, which does not depend on \(\sigma \). Therefore, if the anonymizer’s assumption about the measurement error is not pessimistic, then \(R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta ,w)} \le R_{\mathcal {V}(x; \sigma _t^2, \epsilon , \Delta ,w)}\) for \(x\le 0\). If the anonymizer sets the value of error parameters as pessimistic (i.e., set \(\sigma \) to a small value), the amount of noise added by the proposed mechanism is larger than the amount needed. Although the usefulness of the proposed algorithm is less in this case, the ratio of the anonymization probabilities generated by the proposed mechanism from two neighboring databases is between \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\), with some extra space available. However, the total loss of the proposed mechanism is less than that of the baseline approach, even in this case. When \(x>0\) is considered, the discussion is similar, and then \(R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta ,w)} > R_{\mathcal {V}(x; \sigma _t^2, \epsilon , \Delta ,w)}\) for \(x>0\).

Because \(1/\exp (\epsilon ) \le R_{\mathcal {V}(x; \sigma _t^2, \epsilon , \Delta ,w)} \le \exp (\epsilon )\) for \(\sigma _t^2\), then \(1/\exp (\epsilon ) \le R_{\mathcal {V}(x; \sigma ^2, \epsilon , \Delta ,w)} \le \exp (\epsilon )\) for \(\sigma ^2\). Therefore, Definition 3 holds.

Categorical Attributes First, Scenario I is considered. It is assumed that the attacker obtains a category ID \(\gamma \) as the anonymized version of a categorical attribute. Let \(P(v_a = \gamma | v_t = i)\) represent the anonymized version of the category ID \(\gamma \) when the probability that the true category ID is i. The proposed mechanism ensures that

$$\begin{aligned} P(v_a = \gamma | v_t = i)={\left\{ \begin{array}{ll} \frac{e^\epsilon }{M-1+e^\epsilon } &{} (i=\gamma ) \\ \frac{1-\frac{e^\epsilon }{M-1+e^\epsilon }}{M-1} &{} ({\text {otherwise}}) \end{array}\right. } \end{aligned}$$
(24)

when we ignore the process in Eq. 22. The ratio of the two equations in Eq. 24 is \(e^\epsilon \) or \(1/e^\epsilon \). Therefore, Definition 3 holds. Based on the post-processing property of differential privacy, the values resulting from the process of Eq. 22 also satisfy TDP.

Next, Scenario II is considered. It is assumed that the anonymizer’s knowledge about the sensing error is not correct but their assumption about the measurement error is not pessimistic. Let \(x_{i \rightarrow j, t}\) and \(x_{i \rightarrow j}\) represent the disguising probabilities based on the true error parameters and the assumed error parameters, respectively. If the error parameters are not pessimistic, then

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{i \rightarrow j, t} \ge x_{i \rightarrow j} &{} (i=j) \\ x_{i \rightarrow j, t} \le x_{i \rightarrow j} &{} ({\text {otherwise}}.) \end{array}\right. } \end{aligned}$$
(25)

Therefore,

$$\begin{aligned} {\left\{ \begin{array}{ll} P(v_a = \gamma | v_t = i) \le \frac{e^\epsilon }{M-1+e^\epsilon } &{} (i = \gamma ) \\ P(v_a = \gamma | v_t = i) \ge \frac{1-\frac{e^\epsilon }{M-1+e^\epsilon }}{M-1} &{} ({\text {otherwise}}) \end{array}\right. } \end{aligned}$$
(26)

From Eqs. 16 and 26, it is concluded that Definition 3 holds.

3.4 Analysis

3.4.1 Numerical Attributes

The proposed mechanism skips the addition of Laplace noise if the generated Laplace noise l is less than the threshold w. Then, the avoidance (or skipping) ratio can be calculated by

$$\begin{aligned} \begin{aligned} \int _{-w}^w \mathcal {L}(x; \Delta /\epsilon )dx =1-e^{-\epsilon w/\Delta }. \end{aligned} \end{aligned}$$
(27)

Let \(\eta _\mathcal {U}\) and \(\eta _\mathcal {V}\) represent the expected values of the amount of additional Laplace noise with respect to the baseline approach and the proposed mechanism, respectively. The value of \(\eta _\mathcal {U}\) can be calculated by

$$\begin{aligned} \eta _\mathcal {U} = \int _{-\infty }^\infty |x| \cdot \mathcal {L}(x; \Delta /\epsilon )dx = \frac{\Delta }{\epsilon }, \end{aligned}$$
(28)

and the value of \(\eta _\mathcal {V}\) can be calculated by

$$\begin{aligned} \begin{aligned} &\eta _\mathcal {V} = \int _{-\infty }^{-w} -x \mathcal {L}(x; \Delta /\epsilon )dx + \int _{w}^{\infty } x \mathcal {L}(x; \Delta /\epsilon )dx\\ &=e^{-\frac{w \epsilon }{\Delta }}(\frac{\Delta }{\epsilon }+w) \end{aligned} \end{aligned}$$
(29)

3.4.2 Categorical Attributes

Let \(\zeta _\mathcal {U}\) and \(\zeta _\mathcal {V}\) represent the probabilities that the true category ID is equivalent to the anonymized category ID that corresponds to the baseline approach and the proposed mechanism, respectively. The baseline approach represents a method that always adds the Laplace noise with respect to numerical attributes, whereas the randomized response method adds the Laplace noise with respect to categorical attributes, as described in Sects. 3.3.1 and 3.3.2. Assuming that the true category ID is i,

$$\begin{aligned} \zeta _\mathcal {U} = p_{i \rightarrow i} \cdot p_\alpha + \sum _j p_{i \rightarrow j}\cdot \frac{1-p_\alpha }{M-1}, \end{aligned}$$
(30)

and

$$\begin{aligned} \zeta _\mathcal {V}= p_{i \rightarrow i} \cdot x_{i \rightarrow i} + \sum _j p_{i \rightarrow j}\cdot x_{j \rightarrow i}. \end{aligned}$$
(31)

3.5 Evaluation

3.5.1 Utility Metric

The data receiver intends to use the anonymized value for several services. Therefore, the estimated value should be close to the true value. Let N represent the number of people whose attribute values are collected. Let \(v_i\) and \(\widetilde{v_i}\) represent the true value and the anonymized value, respectively, of an attribute of person i.

The utility is defined as follows with respect to numerical attributes:

$$\begin{aligned} U_n = \frac{1}{N}\sum _{i=1}^N \left( 1- \frac{|v_i - \widetilde{v_i}|}{\Delta }\right) , \end{aligned}$$
(32)

whereas the utility is defined as follows with respect to categorical attributes:

$$\begin{aligned} U_c = \frac{1}{N} \sum _{i=1}^N \delta _{v_i, \widetilde{v_i}}, \end{aligned}$$
(33)

where \(\delta _{i,j}\) is the Kronecker delta

$$\begin{aligned} \delta _{i,j}= {\left\{ \begin{array}{ll} 1 &{} (i=j) \\ 0 &{} (i\ne j). \end{array}\right. } \end{aligned}$$
(34)

Both metrics are considered superior if their values are significant.

Some methods can estimate statistical values (e.g., averages) or generate cross-tabulations of the collected data. If the goal is to generate cross-tabulations, then a total loss, which compares the true cross-tabulation with the generated cross-tabulation, should be used. However, in this section, the focus is mainly on the data of a single individual; that is, the aim is not to do a statistical analysis but to use the attribute value for each person, because IoT-related services such as the health monitoring, context-aware recommender systems, and navigation described in Sect. 3.1 need to analyze an individual’s attribute value.

Fig. 5
3 multiline graphs plot noise reduction rate versus sigma over delta. a. 4 lines for noise count and amount of noise mathematical and simulation increase to end below 0.6 on the y-axis for epsilon = 1. b. The lines increase to (0.5, 1) for epsilon = 4. c. The lines reach (0.1, 1) for epsilon = 10.

Reduction rate of the proposed mechanism with respect to noise addition counts and amount of Laplace noise. (The results are for \(\Delta = 10\). Results for \(\Delta =100\) and \(\Delta =1000\) are almost the same)

3.5.2 Numerical Value Results

\(\Delta \) is set within the range of 10–1,000, \(\epsilon \) within the range of 1–10, and \(\sigma \) within the range from 1/40 of the value of \(\Delta \) to 1/2 of the value of \(\Delta \). We evaluated the number of times the proposed mechanism skipped the addition of Laplace noise to a measured value, as well as the mechanism’s ability to reduce the average amount of Laplace noise added. The results for \(\Delta =10\) are shown in Fig. 5, along with the computed results of Eqs. 27, 28, and 29. The results for \(\Delta =100\) and \(\Delta =1000\) are nearly identical to those in Fig. 5 and are therefore not shown.

The computed results based on Eqs. 27, 28, and 29 align closely with the simulation results for all parameter settings. The proposed mechanism reduced the frequency of Laplace noise addition and reduced the corresponding average Laplace noise. Large values of \(\sigma \) or \(\epsilon \) result in a significant reduction rate. A high value of \(\sigma \) indicates a substantial sensing error noise has already been added to a true value, whereas a high value of \(\epsilon \) signifies a low privacy protection level, meaning a large amount of noise is not necessary. Consequently, the proposed mechanism reduces additional Laplace noise, particularly when the values of \(\sigma \) and \(\epsilon \) are large. According to Eq. 8, the need to add noise cannot be entirely avoided. However, Fig. 5 shows that the noise skipping ratio approaches one.

We evaluated \(U_n\) using Eq. 32 with the same values for \(\Delta \) and \(\epsilon \) as above (Fig. 6). A large \(\sigma \) value results in a low \(U_n\) (i.e., a high total loss), even if none of the privacy protection mechanisms are used, so the difference between the proposed mechanism and the baseline approach is small. This is true not only when the value of \(\sigma \) is small, but also when it is large. However, if \(\sigma \) is set to a medium value, the proposed mechanism can reduce the total loss \(U_n\) by 25%–40% compared with the baseline approach. When \(\epsilon \) is set to one, the difference between the proposed mechanism and the baseline approach is small. However, when the value of \(\epsilon \) equals one, the average absolute value of the Laplace noise to be added is about 50 when \(\Delta =100\). This amount of noise appears to be quite large. Therefore, in typical cases, the value of \(\epsilon \) should be larger.

Fig. 6
3 multiline graphs plot U n versus sigma over delta. a. 4 lines for proposal and baseline mathematical and simulation decrease to end above negative 0.10 on the y-axis for epsilon = 1. b. The lines end above 0.50 on the y-axis for epsilon = 4. c. The lines end at around (0.5, 0.6) for epsilon = 10.

\(U_n\) results (The results are for \(\Delta = 10\). Results for \(\Delta =100\) and \(\Delta =1000\) are almost the same)

We determined the actual ratio of probability density function values whose distance is \(\Delta \) by conducting simulations. The true values to be protected were set to \(-\Delta /2\) and \(\Delta /2\). Noise from the normal distribution was randomly added to the true values independently. The noise-added values were anonymized using the proposed mechanism and the baseline approach, respectively. Histograms with 200 bins were created for the range \(-3 \Delta \) to \(3 \Delta \). This simulation was repeated \(2^{31}\) times. In Fig. 7, we present an example of the average result with \(\epsilon =2\), \(\Delta =100\), and \(\sigma = 25\). The ratio of the probability density function values of the normal distribution and the Laplace distribution, along with \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) functions, are also shown as a reference. The results for both the proposed and the baseline approaches lie within the range from \(\exp (\epsilon )\) to \(1/\exp (\epsilon )\). Consequently, we conclude that both mechanisms (for Scenarios I and II) achieve TDP. The ratio of the probability density function values of the Laplace distribution is the same as \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) in the range where \(x<-\Delta /2\) and \(\Delta /2 < x\); thus, the Laplace mechanism is optimal if the measured values have no error. As for the proposed mechanism, the ratio of the probability density function values approaches \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) at approximately \(x=-\Delta /2\) and \(x=\Delta /2\). However, this ratio deviates slightly from \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) at \(x=-40\) and \(x=40\). In contrast, the baseline approach’s ratio of the probability density function values reaches \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) at about \(x=-30\) and \(x=30\). It is important to note that the probability density function values are large when x is near zero; therefore, high utility can be achieved if the ratio is close to \(\exp (\epsilon )\) and \(1/\exp (\epsilon )\) when x is near zero. Hence, the proposed mechanism attains high utility (i.e., low total loss) relative to the baseline approach.

Fig. 7
A multiline graph. The overlapping lines for Laplace distance, baseline, and proposal decrease from the exponential of epsilon to 1 over the exponential of epsilon. The line for normal distance decreases linearly from above the exponential of epsilon to below 1 over the exponential of epsilon.

Example of simulation results: the ratio of probability distributions for numerical attributes (\(\epsilon =2, \Delta =100, \sigma =25\))

We conducted additional simulations with different parameter settings. As a result, we confirmed that the ratio of the probability density function values of the proposed mechanism lies within the range from \(\exp (\epsilon )\) to \(1/\exp (\epsilon )\), except for those results that show considerable variation due to an insufficient number of samples in each bin.

3.5.3 Categorical Value Results

The value of \(\epsilon \) was set to the range 1–10, the value of M was set to the range 5–100, and the value of \(\tau \) was set to the range 0.3–0.9. The true category ID was set to a random integer, and the category ID with probability \(1-\tau \) was changed. Then, the category ID was randomized using the baseline mechanism and using the proposed mechanism. This simulation was repeated \(2^{31}\) times. The results for \(\epsilon \) equal to one are shown in Fig. 8. The simulation results along with the computed results calculated using Eqs. 30 and 31 are also presented. A close agreement can be observed between the simulated and computed results.

Fig. 8
6 multiline graphs plot U c versus M. a and b. Tau = 0.6 and epsilon = 1 and 4 have 4 decreasing lines for proposal and baseline. c. Epsilon = 10 has constant lines. d. M = 50 and epsilon = 1 has constant proposal and increasing baseline. e and f. M = 50 and epsilon = 4 and 10 have increasing lines.

\(U_c\) results

The values of \(U_c\) obtained using the proposed method are larger than or equal to those obtained using the baseline approach for all parameter settings. When M is large or \(\epsilon \) is small, the values of \(U_c\) are small for both mechanisms since it is difficult to maintain high accuracy for both mechanisms in such cases. However, in other cases, the proposed mechanism reduces the total loss more than does the baseline approach, especially when \(\epsilon \) is small, i.e., the privacy protection level is high. When \(\epsilon \) is large, the experimental results of the proposed method are similar to those of other methods; the value of \(\epsilon \) is large enough that the noise added to achieve differential privacy is very small. This is why there was no difference in accuracy between the methods in such cases. Therefore, it is more important to experiment when the value of \(\epsilon \) is small.

3.5.4 Real Dataset Results

Simulations were conducted using a real dataset called the Adult dataset [10], which is a widely used benchmark in research for privacy-preserving data mining. This dataset consists of six numerical attributes and nine categorical attributes, and it has 30,162 records when unknown values are excluded.

We assumed that each value in the Adult dataset was true. We also assumed that IoT devices estimated age, sex, race, and native country using estimation methods [21]. For numerical attributes, \(\sigma \) was set to 0.1 of the value of \(\Delta \), and \(\epsilon \) was set to 8. For categorical attributes, \(\tau \) was set to 0.6, and \(\epsilon \) was set to 2.

The simulation results are presented in Table 3. The names of the attributes, along with the values of \(\Delta \) and M, are also shown. The proposed mechanism was able to increase \(U_n\) to approximately 92% from approximately 85% for all numerical attributes and to increase \(U_c\) by a maximum of 20% for the categorical attributes relative to the baseline approach. These results demonstrate that the proposed mechanism enhances utility (i.e., reduces total loss) for real datasets.

Table 3 Adult dataset results [10]

Lastly, simulations were conducted using other real datasets with the same parameter settings as above.

A dataset of activities based on multisensor data fusion (AReM dataset) [45] was used for numerical attributes. This dataset consists of 42,239 instances of six numerical attributes.

Datasets containing daily living activities as recognized by binary sensors (ADL dataset) [43], the activities of healthy older people using non-battery wearable sensors (RFID dataset) [68], and the localization of people’s activity (Localization dataset) [26] were used for the categorical attributes. The numbers of instances in these datasets are 741, 75,128, and 164,860, respectively.

The simulation results are displayed in Table 4. These results show that the proposed mechanism outperforms the baseline approach on all datasets used in this study.

Table 4 Results for four real datasets

3.6 Related Research Work

A considerable amount of research has been conducted on anonymized data collection. Wang et al. [72] introduced a method for identifying the top-k most frequently used new terms by gathering term usage data from individuals under the constraint of differential privacy. Kim et al. [30] derived population statistics by collecting differentially private indoor positioning data. Encryption-based approaches for anonymized data collection have also been explored [36]. These methods primarily focus on obtaining aggregate values and are not intended to acquire individual values. Furthermore, they do not account for error in the collected values. In contrast, the proposed scenario seeks to obtain each person’s value as accurately as possible, as services like recommender systems require individual attribute values.

Abul et al. [2] and Sei et al. [53] put forth location anonymization methods that consider location error and achieve k-anonymity [41, 42, 58, 65], which is a fundamental privacy metric. However, these methods are not applicable to \(\epsilon \)-differential privacy.

Ge et al. [15] and Krishnan et al. [32] proposed techniques for privately cleaning “dirty data”. By employing differential privacy as a privacy metric, they focused on data cleaning to resolve inconsistencies in large databases containing the true data for multiple individuals. They assumed that each database value was accurate, and they utilized the Laplace mechanism without considering the potential error in the values.

Several studies have suggested the use of machine learning methods, such as deep neural networks (deep learning), to process IoT sensing values with differential privacy. Shi et al. [62] proposed a reinforcement technique for transportation network companies that use passenger data. Xu et al. [79] concentrated on mobile data analysis in edge computing, and Guan et al. [19] applied machine learning to the Internet of Medical Things. Although these studies employed differential privacy as a privacy metric; they did not consider the proposed true-value-based differential privacy (TDP). It is posited that the application of TDP could enhance the accuracy of these methods while preserving the desired levels of privacy protection.

4 Missing Values Under LDP

4.1 Introduction

To achieve smart cities, as previously mentioned, it is essential to collect and analyze vast amounts of personal data while ensuring the protection of privacy. Even when it is anonymized, a large amount of sensitive personal information is difficult to acquire. Moreover, this information may have missing values, as individuals are more likely to provide incomplete confidential information than to provide all their confidential information (Fig. 9).

In this section, we propose a method for privacy-preserving data collection that considers a large number of missing values. The personal data to be collected are anonymized on each person’s device and/or computer in authorized entities, and are then sent to a data collection server. Each person can select which data to share or not to share. The data collection server creates a generative model and contingency table suitable for multi-attribute analysis based on the expectation–maximization and Gaussian copula methods.

We considered that if the value distribution of one or two attributes could be restored, the error in each attribute could be limited even when there are several missing values. Copula enables data generation when certain information (such as the correlation and mutual information) is available for each pair of attributes. We thus combined the features of copula with those of data recovery using differential privacy. To our knowledge, this idea is novel to privacy-preserving data collection.

Fig. 9
A flow diagram. 3 tables with age, body temperature, location, and antibody against COVID-19 are followed by data collection under L D P, construction of a generative model and estimation of a contingency table with a 3-D surface plot of the number of people versus other values, and a server.

Example application of missing values

4.2 Proposed Method

We leverage differential privacy to anonymize patient personal data on the client side. The server collects the anonymized data and reconstructs the distributions of each attribute, as well as all the combinations that include two attributes. From the two-attribute distributions, the mutual information of all attribute pairs is computed. Subsequently, a Gaussian copula [16, 51] is employed to calculate the generative model of patient personal data from the mutual information. As our proposed method needs only information about the combination of every attribute pair, it is robust to missing values. To visualize the generative model, we construct a contingency table using the generative model and the distribution of each attribute. The notation employed in this study is listed in Table 5.

Table 5 Notation

In the proposed approach, the server constructs a copula model to analyze the collected differentially private data while mitigating the noise introduced by the differentially private technique. As detailed in Sect. 4.2.2, the construction of a copula model requires the value distribution of each attribute and the mutual information of all attributes. Therefore, the proposed method initially estimates the single-attribute distributions (Sect. 4.2.2) before estimating the attribute-pair distributions (Sect. 4.2.2). The generation of the copula model is described in Sect. 4.2.2. The copula model can generate an arbitrary number of data samples that do not have missing values. From these data samples, a contingency table is constructed (Sect. 4.2.2).

4.2.1 Anonymization on the Client Side

Let \(s_{ij}\) represent the value of attribute \(A_j\) of patient i. The number of attributes is g; that is, patient i has attribute values \(s_{i1}, \ldots , s_{ig}\). Some values of \(s_{ij}\) may be missing. Let \(f_j\) be the number of categories of \(A_j\).

We anonymize each non-missing value \(s_{ij}\). Let \(V_j\) represent the domain of \(A_j\) and let \(V_{jk}\) represent the kth value of \(V_j\). For example, assume that \(A_1\) represents the attribute of a disease {COVID-19, flu, cancer}. In this case, \(f_1=3\) and \(V_{11}, V_{12},\) and \(V_{13}\) are COVID-19, flu, and cancer, respectively.

Based on a previous method [52], we create a value set \(R_{ij}\) for each attribute \(A_j\) as follows:

$$\begin{aligned} R_{ij} \!=\! {\left\{ \begin{array}{ll} \{s_{ij}\} \cup Ran(V_j\backslash \{s_{ij}\}, h_j-1) &{} \!\text {with prob.} p_j \\ Ran(V_j\backslash \{s_{ij}\}, h_j) &{} \!\text {otherwise}, \end{array}\right. } \end{aligned}$$
(35)

where Ran(Sh) represents a function that randomly selects h elements without duplication from set S. For example, assume that \(S=\{A, B, C\}\) and h=2. In this case, Ran(Sh) outputs \(\{A,B\}\), \(\{B,C\}\), or \(\{A,C\}\). To satisfy \(\epsilon \)-differential privacy, the parameters \(h_j\) and \(p_j\) are respectively determined as

$$\begin{aligned} \begin{aligned} h_j &= \max {\left( \left\lfloor \frac{f_j}{1+e^\epsilon } \right\rfloor ,1\right) } \,\,\,\,\text {and}\\ p_j &= \frac{e^\epsilon h_j}{f_j-h_j+e^\epsilon h_j} \end{aligned} \end{aligned}$$
(36)

following [52]. As there are g attributes in our scenario, each \(R_{ij}\) should satisfy \(\epsilon /g\)-differential privacy [27].

Algorithm 3 is the anonymization algorithm on the client side.

The privacy budget allocated to each attribute is \(\epsilon /g\). Even if all the attributes are the same, i.e., the correlations between the attributes are all 1, we satisfy \(\epsilon \)-differential privacy due to the composition property of differential privacy [27].

A 6-line algorithm for anonymization of patient i. The inputs are privacy parameter epsilon, original data s i 1 to s i g, and each domain V j. The output is the anonymized version of s i 1 to s i g. The for loop is based on equations 36 and 35.

4.2.2 Estimation on the Server Side

The data collection server first estimates the value distribution of each attribute as described in Sect. 4.2.2. It then estimates the value distribution of each attribute pair as described in Sect. 4.2.2. Using these estimated value distributions, the server creates a generative model (a Gaussian copula; see Sect. 4.2.2). Finally, it generates n complete data records and creates a contingency table of target attributes, which is specified by a data analyzer (Sect. 4.2.2).

Separated estimation: estimation of a value distribution for each attribute

Each client sends its true value and \((h_j-1)\) randomly selected values other than the true value with probability \(p_j\), and each sends \(h_j\) randomly selected values other than the true value with probability \((1-p_j)\) for attribute j, as represented in Algorithm 1. As a result, the probability that the true value is sent is \(p_j\), and the probability that another value is sent is

$$\begin{aligned} q_j=\frac{p_j (h_j-1)}{f_j-1}+\frac{(1-p_j ) h_j}{f_j-1}=\frac{h_j-p_j}{f_j-1}, \end{aligned}$$
(37)

as for attribute j. Here, because a total of \(h_j\) values are sent, \(p_j+(f_j-1)q_j=h_j\).

Let \(w_{jk}\) represent the number of occurrences of \(V_{jk}\) in \(\{\boldsymbol{R_1}, \ldots , \boldsymbol{R_n}\}\), and let \(u_{jk}\) represent the true number of occurrences of \(V_{jk}\). Thus, we have the following equation:

$$\begin{aligned} \begin{pmatrix} w_{j1}\\ w_{j2}\\ \vdots \\ w_{jf_j} \end{pmatrix} =M \begin{pmatrix} u_{j1}\\ u_{j2}\\ \vdots \\ u_{jf_j} \end{pmatrix}, \end{aligned}$$
(38)

where M is the matrix in which the diagonal elements are \(p_j\), and the other elements are \(q_j\). The symbol \(z_{jk}\) represents the estimated number of occurrences of \(V_{jk}\). We can easily estimate these values by calculating the following equation:

$$\begin{aligned} \begin{pmatrix} z_{j1}\\ z_{j2}\\ \vdots \\ z_{jf_j} \end{pmatrix} =M^{-1} \begin{pmatrix} w_{j1}\\ w_{j2}\\ \vdots \\ w_{jf_j} \end{pmatrix}, \end{aligned}$$
(39)

where \(M^{-1}\) represents the inverse matrix of M. However, the estimation accuracy is very low [25]. Moreover, calculating the inverse function requires significant computational time, particularly for a large matrix. To overcome these limitations, we use the expectation-maximization (EM)-based algorithm. If we know the values of \(u_{jk}\), we can calculate each expected value of \(w_{jk}\). In our problem setting, we know the actual values of \(w_{jk}\); however, we do not know \(u_{jk}\). Therefore, with \(u_{jk}\) as an unobserved latent variable, the EM-based algorithm can provide maximum a posteriori estimation. It can find the unobserved latent variables that best explain the observed values. Moreover, the EM-based algorithm can ensure the increase in likelihood with each iteration [33, 78].

The symbol \(\widetilde{n_j}\) represents the number of records that contain a value for attribute \(A_j\):

$$\begin{aligned} \widetilde{n_j} = \sum _{k=1}^{f_j} w_{jk}. \end{aligned}$$
(40)

Let \(z_{jk}\) represent the estimated number of occurrences of \(V_{jk}\) in \(A_j\). From the expectation–maximization-based algorithm [52], we obtain \(z_{jk}\) by repeating the following substitution:

$$\begin{aligned} z_{jk} \leftarrow z_{jk} (p_j \mathcal {D}_k + q_j (\mathcal {E}-\mathcal {D}_k)), \end{aligned}$$
(41)

where

$$\begin{aligned} q_j = \frac{h_j-p_j}{f_j-1}, \end{aligned}$$
(42)
$$\begin{aligned} \mathcal {D}_k = \frac{w_{jk}}{p_j z_{jk}+q_j(h_j\widetilde{n_j}-z_{jk})}, \end{aligned}$$
(43)

and

$$\begin{aligned} \mathcal {E} = \sum _{k=1}^{f_j} \mathcal {D}_k. \end{aligned}$$
(44)

Separated estimation: estimation of a value distribution for every two-attribute combination Let \(V_{jj'}\) be a combination of the elements of attributes \(A_{j}\) and \(A_{j'}\):

$$\begin{aligned} V_{jj'} = V_j \times V_{j'}. \end{aligned}$$
(45)

Let \(w_{jj'kk'}\) represent the number of simultaneous occurrences of \(V_{jk}\) and \(V_{j'k'}\) in each record in \(\{\boldsymbol{R_1}, \ldots , \boldsymbol{R_n}\}\). The symbol \(\widetilde{n_{jj'}}\) represents the number of records in which a value exists for both attributes \(A_j\) and \(A_{j'}\):

$$\begin{aligned} \widetilde{n_{jj'}} = \sum _{k=1}^{f_j} \sum _{k'=1}^{f_{j'}} w_{jj'kk'}. \end{aligned}$$
(46)

As an example, assume that Table 6 was created by the privacy-preserving data collection. The values of \(\widetilde{n_1}\), \(\widetilde{n_2}\), and \(\widetilde{n_3}\) are 4, 2, and 3, respectively, because attribute \(A_1\) has four values, attribute \(A_2\) has two values, and attribute \(A_3\) has three values. The value of \(\widetilde{n_{1,2}}\) is 2 because two records (the first and fourth records) contain values for both \(A_1\) and \(A_2\) (the values are [39, 40, 58, 35.2, 35.5] and [33, 34, 88, 37.5, 37.6]). Similarly, the values of \(\widetilde{n_{1,3}}\) and \(\widetilde{n_{2,3}}\) are 3 and 1, respectively.

Table 6 Example table created by privacy-preserving data collection

As in Sect. 4.2.2, we estimate the occurrence of each combination \(V_{jk}\) and \(V_{j'k'}\) of attributes \(A_j\) and \(A_{j'}\) for n patients. By calculating these values for all combinations \(A_j\) and \(A_{j'}\), we can estimate all value distributions of all attribute pairs.

After estimating the attribute-pair distribution, the mutual information of attributes j and \(j'\) is calculated as follows:

$$\begin{aligned} \sum _{k \in V_j}\sum _{k' \in V_{j'}} p(k,k') \log \frac{p(k,k')}{p(k)p(k')}, \end{aligned}$$
(47)

where \(p(k,k')\) represents the joint probability that \(V_{jk}\) and \(V_{j'k'}\) occur, and p(k) represents the probability that \(V_{jk}\) occurs.

Generative model construction: constructing a generative model as the Gaussian copula Let \(X_1, \ldots , X_g\) be random variables, and let \(F(x_1, \ldots , x_g)\) represent the joint probability distribution function of \(X_1, \ldots , X_g\). The marginal distribution functions \(F_1, \ldots , F_g\) and the joint probability distribution function have the following relationship.

Theorem 4 (Sklar’s Theorem[63])

[Sklar’s Theorem[63]] A function C uniquely satisfies the following expression:

$$\begin{aligned} \begin{aligned} F(x_1, \ldots , x_g) &= Pr(X_1\le x_1, \ldots , X_g \le x_g)\\ & = C(F_1(x_1), \ldots , F_g(x_g)). \end{aligned} \end{aligned}$$
(48)

From Sklar’s theorem, we have:

$$\begin{aligned} C(u_1, \ldots , u_g) = F(F_1^{-1}(u_1), \ldots , F_g^{-1}(u_g)), \end{aligned}$$
(49)

for arbitrary \(\boldsymbol{u} = (u_1, \ldots , u_g)\) where (\(u_i \in [0,1]\)). Based on Sklar’s Theorem, we have:

$$\begin{aligned} \begin{aligned} \Phi _g(x_1, \ldots , x_g; \Sigma ) &= Pr(X_1 \le x_1, \ldots , X_g\le x_g)\\ &=C(\Phi (x_1), \ldots , \Phi (x_g)), \end{aligned} \end{aligned}$$
(50)

where \(\Phi (\cdot )\) represents the cumulative distribution function of a standard Gaussian distribution, and \(\Phi _g(x_1, \ldots , x_g; \Sigma )\) represents the cumulative distribution function of a g-dimensional Gaussian distribution with random variables \(X_1, \ldots , X_g\) and a covariance matrix \(\Sigma \).

From (50), the cumulative distribution of the Gaussian copula can be expressed as

$$\begin{aligned} C(u_1, \ldots , u_g) = \Phi _g(\Phi ^{-1}(u_1), \ldots , \Phi ^{-1}(u_g); \Sigma ). \end{aligned}$$
(51)

The Gaussian copula C represents the cumulative distribution function of each marginal distribution, which is a uniform distribution in the range [0, 1]. The probability density function of the Gaussian copula \(c(u_1, \ldots , u_g; \Sigma )\) satisfies the following relationship:

$$\begin{aligned} \begin{aligned} \phi (x_1, \ldots , x_g) = c(\Phi (x_1), \ldots , \Phi (x_g)) \prod _{j=1}^g \phi (x_j), \end{aligned} \end{aligned}$$
(52)

where \(\phi (\cdot )\) represents the probability density function of a standard Gaussian distribution, i.e.,

$$\begin{aligned} \phi (x_1, \ldots , x_g) = \frac{1}{\sqrt{(2\pi )^g |\Sigma |}} \exp (-\frac{1}{2}x^T \Sigma ^{-1} x). \end{aligned}$$
(53)

Therefore, we have

$$\begin{aligned} c(u_1, \ldots , u_g) = \frac{1}{\sqrt{|\Sigma |}}\exp {(-\frac{1}{2} \boldsymbol{\omega }^T (\Sigma ^{-1}-\boldsymbol{I})\boldsymbol{\omega })}, \end{aligned}$$
(54)

where \(\boldsymbol{\omega } = \Phi ^{-1}(\boldsymbol{u})\).

\(\Sigma \) must be estimated from the collected data. Let \(\boldsymbol{u}^i\) and \(\boldsymbol{\omega }^i\) represent the ith \(\boldsymbol{u}\) and ith \(\boldsymbol{\omega }\), respectively. Then, from (54), the log-likelihood function of the Gaussian copula is given by

$$\begin{aligned} l(\Sigma ) = -\frac{n}{2} \ln |\Sigma | -\frac{1}{2} \sum _{i=1}^n \boldsymbol{\omega }^{i^T}(\Sigma ^{-1}-\boldsymbol{I})\boldsymbol{\omega }^i, \end{aligned}$$
(55)

where \(\boldsymbol{\omega }^j = \Phi ^{-1}(\boldsymbol{u}^j)\). Differentiating (55) with respect to \(\Sigma ^{-1}\), we obtain [69]

$$\begin{aligned} \frac{\partial l(\Sigma )}{\partial \Sigma ^{-1}} = \frac{n}{2} \Sigma -\frac{1}{2} \sum _{i=1}^n \boldsymbol{\omega }^i\boldsymbol{\omega }^{i^T}. \end{aligned}$$
(56)

Therefore, the maximum likelihood estimator \(\widehat{\Sigma }\) is

$$\begin{aligned} \widehat{\Sigma } = \frac{1}{n} \sum _{i=1}^n \boldsymbol{\omega }^i\boldsymbol{\omega }^{i^T}. \end{aligned}$$
(57)

To alleviate the high computational cost of (57), we estimate \(\Sigma \) using a suboptimal approach [51]. First, we calculate the mutual information of every pair of attributes using the reconstructed data in Sect. 4.2.2. We then determine each suboptimal element of \(\Sigma \) that minimizes the distance between the mutual information of the estimated joint distribution and that calculated from the reconstructed data (see Sect. 4.2.2).

Contingency table construction: generation of records based on the generative model We generated n complete data from the Gaussian copula C and the reconstructed data in Sect. 4.2.2. The n values of each attribute \(A_j\) were determined based on the estimated attribute distribution in Sect. 4.2.2. We also generated random values \(\bar{x_1}, \ldots , \bar{x_g}\) based on a g-dimensional Gaussian distribution with covariance matrix \(\widehat{\Sigma }\). We then obtained \(u_i = \Phi (x_j)\) for all \(i=1,\ldots ,g\). From the reconstructed data in Sect. 4.2.2, we finally obtained \(F_j^{-1}(u_j)\) for each attribute value, where \(F_j\) represents the marginal distribution of attribute \(A_j\).

Contingency table construction: counting each combination of target attributes After the above process, we obtained n complete data records with g attributes. If a contingency table is used for many attributes, it loses its primary value [18, 77]. Therefore, data analyzers generally select several attributes. The target contingency table is then constructed by simply counting the occurrences of each combination of attribute values from the n generated complete data records.

4.3 Evaluation

4.3.1 Evaluation Setting

We compared the performances of the proposed method and four state-of-the-art methods: O-RAPPOR [25], S2Mb [52], MDN [17], and PDE/ETE (the baseline approach).

The experimental results for the simple combination of the differentially private technique on the client side and the copula technique on the server side are also shown. This method is referred to as DF+Copula.

If the estimated contingency table generated by one of the methods was similar to that generated from the valid data, which was unknown to the data collection server, then the estimated contingency table was considered to be well-generated by the model.

In this study, a contingency table is considered to be a probability distribution of attribute values. To measure the difference between the probability distributions, we applied Jensen–Shannon (JS) divergence rather than the usual Kullback–Leibler (KL) divergence, because the KL divergence assumes all probabilities are non-zero. If any probabilities are zero, the KL divergence fails due to a division-by-zero error. The JS divergence is based on the KL divergence but does not impose the non-zero constraint.

In the Apple implementation, \(\epsilon \) equals 1 or 2 per datum [66]. In the evaluations by the Apple differential privacy team, \(\epsilon \) was set to 2, 4, and 8 [3]. Microsoft described their differentially private framework, and according to their paper, they set \(\epsilon \) between 0.1 and 10 [9]. In the paper that proposed RAPPOR [13], which was developed by Google, \(\epsilon = \log (3)\) was used as the main setting. Hsu showed that, in the literature, \(\epsilon \) ranges from 0.01 to 10 [22]. Based on the settings reported in the literature, we set the value of \(\epsilon \) between 0.01 and 10.

We varied the missing value rate m from 0.3 to 0.8, and we varied the number of attributes c in the analysis from 1 to 5. The reported results are the averages of 100 experiments for each parameter setting. For the default parameters, we set \(m=0.5\), \(c=3\), and \(\epsilon =5\).

Note that the missing value rate m is used only for the experiments, and the proposed algorithm does not require this information. The number of targeted attributes for analysis c can be freely determined by the data analyst according to the purpose of the analysis.

4.3.2 Experiments on Real Data

In the real-data experiments, we first investigated the Adult dataset [10], which is widely used in evaluations of privacy-preserving data mining techniques (for example, see [14, 24, 64]). The Adult dataset consists of 15 attributes (e.g., age, income) in 32,561 records. The number of categories in our experiments was set between 2 and 9 per attribute.

Figure 10a–c present the experimental results.

When the missing value rate was small or \(\epsilon \) was large, the JS divergence of the proposed method was similar to the JS divergences of S2Mb, PDE/ETE, and O-RAPPOR. Similarly, when \(\epsilon \) was small, the JS divergence of the proposed method was similar to the JS divergences of S2Mb, PDE/ETE, and DF+Copula.

However, at high rates of missing values, the proposed method outperformed the other methods, achieving a high level of privacy protection.

To determine whether the proposed method is applicable to small datasets, we randomly sampled 10% of the 32,561 records in the Adult dataset and measured their JS divergence. Figure 10d–f present the results. Owing to the data sparsity, this estimation task was more difficult than in the other experiments, and the JS divergence for all methods was higher for the 3,256 records than for the 32,561 records. However, the proposed method was robust to the small dataset. On a larger dataset with an insignificant missing value rate, the JS divergence was higher for the proposed method than that for the existing methods. Therefore, regardless of the missing value rate, the proposed method outperformed the other methods on smaller datasets.

Fig. 10
6 line graphs plot J S divergence versus m, c and epsilon. a and d. M D N, P D E or E T E, O R A P P O R, S 2 M b and proposal increase. D F plus Copula decreases for epsilon = 5 and c = 3. b and e. Lines increase for epsilon = 5, m = 0.5 and n = 32561. c and f. Lines decrease for c = 3 and m = 0.5.

Results for the adult dataset

We then used the Communities and Crime Unnormalized dataset [1] (hereafter referred to as the Community dataset). This dataset contains 124 predictive attributes, such as the percentage of individuals aged 25 and over with a bachelor’s or higher degree, which could be considered private information in some communities.

After removing 22 attributes that had more than 80% missing values, we retained 102 attributes for analysis.

Figure 11 presents the experimental results for the Community dataset. The results are similar to those of the Adult dataset. For almost all parameter settings, the proposed method outperformed the other methods. As the number of participants n was smaller than in the previous experiments, increasing the missing value rate increased the JS divergence of the proposed method. However, the increase in JS divergence was not considerable.

Fig. 11
3 multiline graphs plot J S divergence versus m, c and epsilon. a. M D N, P D E or E T E, O R A P P O R, S 2 M b and proposal increase, while D F plus Copula decreases for epsilon = 5 and c = 3. b. Lines increase for epsilon = 5 and m = 0.5. c. Proposal decreases below others for c = 3 and m = 0.5.

Results for the community and crime unnormalized datasets

We next used a default dataset containing 21,985 records with the following attributes: sex, job, income, number of loans from other companies, number of delayed payments, and a default flag (0 or 1). Here, the word default means that a debtor failed to pay off a loan. The results of this dataset, which was generated from authentic default data, are plotted in Fig. 12. As shown in Fig. 12a, the proposed method accurately reconstructed the contingency tables even when the missing value ratio (m) increased to 0.8. On the contrary, the accuracies of the existing methods greatly decreased as the missing value ratio increased. Increasing the number of attributes used for generating contingency tables (c) also increased the reconstructed error (Fig. 12b). However, the proposed method was more resistant to an increasing c than were the other methods. Figure 12c shows the effect of \(\epsilon \) on the reconstruction error in the five methods. When \(\epsilon \) was sufficiently large, the accuracies of all methods were very similar, but when \(\epsilon \) was small, the reconstructed error of the proposed method was clearly the lowest.

Fig. 12
3 multiline graphs for the default dataset plot J S divergence versus m, c and epsilon. a. M D N, P D E or E T E, O R A P P O R, S 2 M b and proposal increase. D F plus Copula decreases for epsilon = 5 and c = 3. b. Lines increase for epsilon = 5 and m = 0.5. c. Lines decrease for c = 3 and m = 0.5.

Results for the default dataset

Finally, we applied a dataset related to the 2019 coronavirus disease (COVID-19) called Patient Medical Data for Novel Coronavirus COVID-19Footnote 3. Hereafter, we refer to this dataset as the COVID-19 dataset. This dataset contains 427,036 records with 23 attributes. More than 90% of the values are missing for 12 of the attributes, and approximately 27% are missing even for basic attributes like age and sex. From the COVID-19 dataset, we extracted the Japanese medical data and analyzed the attributes that had few missing values (namely age, sex, administrative division, date of confirmation, and chronic disease status). The date of confirmation was categorized by month, and the number of categories in each attribute ranged from 2 to 29.

Figure 13 presents the results for the COVID-19 dataset. Under all parameter settings, the JS divergence was lower for the proposed method than for the other methods. As the rate of missing values in the original COVID-19 dataset was 68.7%, we concluded that the proposed method effectively processes real datasets with missing values.

Fig. 13
3 multiline graphs of J S divergence versus m, c and epsilon for patient medical data. a. M D N, O R A P P O R, P D E or E T E, S 2 M b and proposal increase. D F plus Copula decreases for epsilon = 5 and c = 3. b. Lines increase for epsilon = 5 and m = 0.5. c. Lines decrease for c = 3 and m = 0.5.

Results of the patient medical data for novel coronavirus COVID-19 dataset

5 Human-to-Human Interactions Under LDP

5.1 Introduction

Smart cities aim to create efficient, sustainable, and livable urban environments by leveraging technology and data. In this context, data related to human-human interactions is pivotal for several reasons. For example, understanding the nature and frequency of human interactions can offer insights into community dynamics. Such data can inform city planners and local authorities about where community hubs or gathering spots might be needed, or where interventions to boost community interaction might be beneficial. Moreover, data on how and where people meet and interact can provide valuable insights into the design and placement of public spaces, transportation nodes, and amenities. For example, if a certain public square sees frequent human interaction, it might be worth investing in better seating, shading, or even establishing transit connections to that area.

Although LDP is considered to be the best technology for privacy protection [6, 60], these organizations apply additional explicit privacy policies for data collection. For example, Apple collects data from users regarding the users’ emoji usage through LDP; however, it does not collect the users’ identities.

In LDP, each user is assigned a privacy budget, which is a non-negative real value. When the user data are sent to the data collector, a portion(or the entirety) of the privacy budget for the user is consumed. The total privacy budget and the consumed amount of the privacy budget can be controlled through an agreement between the data collector and user. For example, suppose that the privacy budget for user A is 10.0, and the value of the privacy budget consumed by transmitting the data of this user is 1.0; the data collector can retrieve user A’s data 10 times. To ensure continuous data collection, the total privacy budget for each user is regularly restored.

If the data collected by the data collector refer to a user’s information regardless of other users, there are no issues because the user has already agreed to the privacy policy. However, what happens if the data collected concerns a person-to-person interaction? Suppose that user A sends an email to user B, and the data collector gathers information about word usage through LDP under a privacy policy agreed upon with user A. The data collected are about the words used by user A, but for user B, the data are about the words they received. In other words, it is equivalent to collecting user B’s data. Therefore, the data collector must also consider user B’s privacy. However, whether user B has agreed to the privacy policy is not checked at present. Even if user B has agreed to the policy, no one has control over user B’s privacy budget.

Figure 14 shows the difference between the assumptions used in previous studies and this study. According to previous studies, when user \(u_1\) sends the LDP value \(y_1\) of their true value \(x_1\) to the data collection server, only \(u_1\)’s private information is provided to the data collection server. This is because the values of each user are completely unrelated to each other. Moreover, suppose that each value depends on another value. In this case, when user \(u_1\) sends LDP value \(y_1\), information about users \(u_1\), \(u_2\), and \(u_3\) is also provided to the data collection server. In other words, although \(u_2\) does not send any information to the data collection server, through the behavior of \(u_1\), some of \(u_2\)’s information is provided to the data collection server.

Fig. 14
2 diagrams. Left. U 1, 2, and 3 with unrelated values x 1, 2, and 3 send L D P values y 1, 2, and 3 to a server, for assumption in existing studies. Right. U 1, 2, and 3 with related values x 1, 2, and 3 send L D P values y 1, 2, and 3 to a server, for assumption in this study.

Assumptions of previous studies and this study

In this study, this problem was formalized as a person-to-person interaction in LDP. To focus the discussion in this section on the new concept of person-to-person interaction under \(\epsilon \)-LDP, we targeted the relatively simple task of obtaining average values from users. The recommendations in this section are expected to have a considerable impact on organizations that collect person-to-person interaction data using \(\epsilon \)-LDP.

5.2 Related Work and Real Applications

5.2.1 Related Work on LDP

Many methods have been proposed for estimating a histogram distribution of users’ values under \(\epsilon \)-LDP, such as the Randomized Aggregatable Privacy-Preserving Ordinal Response, Sarve, and so on [13, 71]. Although such methods achieve high accuracy, their techniques cannot be applied to a person-to-person interaction scenario. This is because they assume that each user’s value is not dependent on any other user’s value.

There have also been several methods proposed for estimating the average value of users. Xue et al. proposed \((\tau , \epsilon )\)-personalized LDP (PLDP) as a privacy metric, Duchi’s solution with PLDP (DCP), and piecewise mechanism with PLDP (PWP) [80]. The \((\tau , \epsilon )\)-PLDP is a privacy metric that weakens \(\epsilon \)-LDP, but DCP and PWP can be used for \(\epsilon \)-LDP. We can assume that the range of a value is \([-1,1]\) without loss of generality. In DCP, each user sends a randomized value v with a probability

$$\begin{aligned} Pr(\epsilon , x) = \frac{(e^\epsilon -1)\cdot x}{(e^\epsilon +1)\cdot 2} + \frac{1}{2}. \end{aligned}$$
(58)

In PWP, each user randomly selects a value from a range around the true value with probability p, where the value of p is determined from \(\epsilon \). A value from a wider range is randomly selected with probability \(1-p\), and the selected value is sent to the server. Because the ratio of \(p/(1-p)\) is \(e^\epsilon \), PWP ensures \(\epsilon \)-LDP.

Li et al. proposed the square wave mechanism (SW) [35]. This mechanism is similar to PWP, but the range of LDP values to be selected is different.

Many other LDP methods have been proposed. Navidan et al. proposed a framework that estimates the number of people in each area while protecting each user’s location privacy using LDP [40]. In this framework, users measure the Received Signal Strength Indicator (RSSI) and determine their locations based on the RSSI. The users then perturb their location information and send it to the data aggregator, who estimates how many users are in each location. The experimental results showed that the proposed framework could estimate location frequency while ensuring differential privacy.

Kim and Jang [29] proposed a data collection method for workload-aware differentially private positioning. They assumed that location is hierarchical and aimed to estimate the density at each location for each level of the hierarchy by utilizing LDP. Their method provides an optimal perturbation scheme to minimize the estimation error for a given workload.

Although many studies target one-shot data-sharing scenarios, several studies have considered cases of data streaming. Please note that our proposed method can be used for data streaming cases by dividing the privacy budget by the number of data acquisitions. By using methods for specified data stream cases, the accuracy of the data analysis can be enhanced. For example, Ren et al. [48] proposed an LDP mechanism for an infinite data stream that targets w-event privacy, which ensures LDP for arbitrary time windows consisting of w consecutive time steps. In the future, we will propose a specialized method for measuring time series data.

Ren et al. [49] proposed an anonymous data aggregation scheme that allows the server to estimate the number of users located within each value area without knowing the location of individual users. In particular, the authors focus on high-dimensional values. The domain sizes of the datasets used in the experiments in [49] were \(2^{16}\), \(2^{52}\), and \(2^{77}\). Experiments with such high-dimensional datasets should be conducted in the future to test our proposed method.

These studies are excellent, but they do not take user interaction into account.

In recent years, studies on federated learning with LDP have gained attention [7, 23, 82]. In a typical federated learning scenario using LDP, the server sends to the clients the machine-learning model parameters that are to be trained. Each client independently trains the machine-learning model using private local data samples. The updated gradient information is sent to the server under the protection of LDP. If each private local data sample is completely unrelated to the private local data samples of other users, \(\epsilon \)-LDP can be ensured in these studies. However, for the person-to-person interaction data envisioned in the current study, when one user sends information to a server through LDP, loss of privacy of other users must be considered as well.

The extant studies on LDP [6, 13, 73] assume that one user’s value is independent of that of any other user. In many cases, this assumption is correct. However, in some scenarios, this assumption does not hold, as discussed in Sect. 5.2.2.

Example 1

Alice transferred $50 to Bob on a single day. Alice has agreed to a 10-LDP (i.e., the amount of the privacy budget is 10), which allows a data collector to gather the amount per day transferred by Alice. Based on this policy, Alice sends the LDP value (e.g., $53) to the data collector, which consumes a 10-privacy budget. Because Alice’s identity is not sent to the data collector, the data collector only knows that someone transferred $53 on that day.

In the above example, the information that is sent is related to Alice’s money transfer. However, for Bob, the information sent is related to Bob’s receipt of money. In this case, Alice’s 10-privacy budget and Bob’s 10-privacy budget are consumed. Therefore, if Bob’s transmission information is also collected, the total amount of privacy budget consumed will be 20, which surpasses the upper limit of 10. Such problems occur in person-to-person interactions in LDP.

5.2.2 Application of LDP Under Person-to-Person Interactions

Recently, LDP has been widely applied to many real services. Apple collects pictogram usage information from users under LDP to analyze the use frequency of each pictogram [8]. However, Apple does not seem to care about the receiver’s privacy.

Several email datasets contain anonymized text information and pseudo personal, sender, and receiver IDs [34]. Such data can be collected under LDP from each user. Emails are generally considered personal data that must be handled with care, regardless of the data that are sent or received. Therefore, if the email information of a sender is collected under LDP, this collection should consume the privacy budget of not only the sender but also the receiver.

Human relationship information, such as that from online social networks, is another form of privacy information. There are several anonymized datasets on human relationships, such as Epinions social network [50]. If the data collector gathers information about who a user is connected to and trusts, the privacy budget of not only the user but also the other person is consumed.

5.3 Problem Definition

We have defined the problem of LDP for person-to-person interactions. This scenario was not assumed in previous studies, but it occurs in real-world situations. One of the most important contributions of this work is to clarify this problem. Numerous forms of person-to-person interactions are possible, but to simplify the discussion in this section, we limit our analysis to the following interactions.

Definition 4 (\(\epsilon \)-LDP in a person-to-person interaction scenario)

[\(\epsilon \)-LDP in a person-to-person interaction scenario] Let \(X_i\) represent the domain of user \(u_i\)’s data, and let \(X_{i,j}\) represent the domain of the interaction data between two users \(u_i\) and \(u_j\) \((i,j=1,\ldots ,n\,\, (i\ne j))\). The value of \(x_i \in X_i\) is obtained from \(x_{i,j} \in X_{i,j}\) for all j except for \(i=j\); i.e., \(x_i=f(x_{i,1},\ldots ,x_{i,i-1},x_{i,i+1},\ldots ,x_{i,n})\) for a function \(f:X_{i,j}^{n-1} \rightarrow X_i\).

User \(u_i\) sends information \(x_i\) under \(\epsilon \)-LDP using mechanism M, which is defined in Definition 1.

Theorem 5 (Consumed privacy budget of \(\epsilon \)-LDP for person-to-person interactions)

[Consumed privacy budget of \(\epsilon \)-LDP for person-to-person interactions] In a scenario of \(\epsilon \)-LDP for person-to-person interactions, the consumed privacy budget of user \(u_i\) is \(\epsilon \). The privacy budget of user \(u_j\) is also consumed, and this amount is represented by

$$\begin{aligned} \begin{aligned} &\min \epsilon _j, s.t. \,\,\,\,P(M(f(x_{i,1},\ldots ,x_{i,n}))=y) \le \\ & e^{\epsilon _j} P(M(f(\ldots ,x_{i,j-1},x_{i,j}',x_{i,j+1},\ldots ))=y), \end{aligned} \end{aligned}$$
(59)

for any \(x_{i,j}, x'_{i,j} \in X_{i,j}\).

Proof

For user \(u_i\), the consumed privacy budget is \(\epsilon \) because \(x_i\) is collected under \(\epsilon \)-LDP.

For user \(u_j\) \((j\ne i)\), the following expression should be satisfied for any \(x_{i,j}, x_{i,j}' \in X_{i,j}\) to ensure \(\epsilon _j\)-LDP because of Definition 1.

$$\begin{aligned} \begin{aligned} & P(M(f(x_{i,1},\ldots ,x_{i,n}))=y) \le \\ & e^{\epsilon _j} P(M(f(x_{i,1},\ldots ,x_{i,j-1},x_{i,j}',x_{i,j+1},\ldots ,x_{i,n}))=y). \end{aligned} \end{aligned}$$
(60)

The smaller the value of \(\epsilon _j\), the smaller the amount of privacy budget consumed and the more robustly the privacy is protected. Therefore, the consumed privacy budget is the minimum value that satisfies (60).\(\square \)

The problem definition in this section is as follows.

Problem 1 (Obtaining the average value under \(\epsilon \)-LDP in a person-to-person interaction scenario)

[Obtaining the average value under \(\epsilon \)-LDP in a person-to-person interaction scenario] Assume there are n users (\(u_1, \ldots , u_n\)), and each privacy budget is set to \(\epsilon _i\). In a person-to-person interaction scenario, the average value of \(x_1, \ldots , x_n\) is obtained with high accuracy while ensuring \(\epsilon _i\)-LDP for each user \(u_i\).

Note that we do not propose a new privacy metric, but we strictly follow \(\epsilon \)-LDP. The difference between the objective in this section and that of previous studies is whether or not each user’s data contain information about other users, which should be protected. To simplify the discussion, the goal of this analysis is to obtain the average value of all users’ data. However, the concept of \(\epsilon \)-LDP in a person-to-person interaction scenario can be applied to any other analysis, such as histogram estimation or machine learning. Such analysis remains to be undertaken in future work.

5.4 Proposed Method

The main notation used in this study is listed in Table 7. We mainly used a Laplace mechanism. The global sensitivity of each user should be clarified when this mechanism is used.

Table 7 Notation

Definition 5 (Global sensitivity for a person-to-person interaction))

[Global sensitivity for a person-to-person interaction)] For user \(u_i\), the global sensitivity is the same as that given in Definition 2. For user \(u_j\) \((j\ne i)\), the global sensitivity of f is defined as

$$\begin{aligned} \Delta f_{i,j}=\max _{x_{i,j},x_{i,j}'\in X_{i,j}} |f(\ldots ,x_{i,j},\ldots )-f(\ldots ,x_{i,j}',\ldots )|. \end{aligned}$$
(61)

Theorem 6 (Consumed privacy budget of the Laplace mechanism in a person-to-person interaction)

[Consumed privacy budget of the Laplace mechanism in a person-to-person interaction] Suppose user \(u_i\) sends the value of \(x_i\) to the data collector under \(\epsilon _i\)-LDP using a Laplace mechanism. Let \(\Delta f_i\) represent the global sensitivity of \(x_i\) and let \(\Delta f_{i,j}\) represent the global sensitivity of \(x_{i,j}\). In this case, \(\epsilon _i\) of the privacy budget of user \(u_i\) and \(\epsilon _i \Delta f_{i,j}/\Delta f_i\) of the privacy budget of everyone else, \(p_{i,j}\), are consumed.

Proof

For \(x_i\), this mechanism ensures \(\epsilon _i\)-LDP according to Equation (1).

For \(x_{i,j}\), the global sensitivity is \(\Delta _{i,j}\). The value sent to the server is represented by

$$\begin{aligned} f(x_i)+\mathcal {L}\left( \frac{\Delta f_i}{\epsilon _i}\right) =f(x_i)+\mathcal {L}\left( \frac{\Delta f_{i,j}}{\epsilon _i \Delta f_{i,j}/\Delta f_i}\right) . \end{aligned}$$
(62)

Therefore, this mechanism ensures \((\epsilon _i \Delta f_{i,j}/\Delta f_i)\)-LDP for \(x_{i,j}\).\(\square \)

Example 2

Consider that users \(u_1\), \(u_2\), and \(u_3\) are giving money to each other. The maximum amount of money given is limited to $100. Therefore, \(\Delta f_1=\Delta f_2=\Delta f_3=100\). User \(u_1\) gives $10 and $20 to \(u_2\) and \(u_3\), respectively. User \(u_2\) gives $30 to \(u_1\). User \(u_3\) gives $40 and $50 to \(u_1\) and \(u_2\), respectively. User \(u_1\) sends information about how many dollars \(u_1\) gave, on average, to the server. In this case, \(x_1=f(x_{1,2},x_{1,3})=15\), where function f is a function that calculates the average. In this example, \(\Delta f_{i,j}\)=50, because a change in \(x_{i,j}\) can affect the value of \(x_i\) by up to 50.

When user \(u_1\) sends a value less than 1-LDP to the server, i.e., the result of 15+\(\mathcal {L}\)(100/1) is sent to the server, this behavior consumes 1, 0.5, and 0.5 of the privacy budget of users \(u_1,u_2, and u_3\), respectively.

Thus far, we have assumed that only one user, (\(u_i\)), sends their LDP value to the data collection server. When several users send their LDP values, the composition of the interaction should be considered.

Theorem 7 (Interaction-composition property of LDP in person-to-person interactions)

[Interaction-composition property of LDP in person-to-person interactions] Suppose that the private information of \(u_i (i=1,\ldots ,n)\) is collected under \(\epsilon _i\)-LDP by the data collection server. Let \(\epsilon _{i,j}\) represent the amount of the privacy budget of user \(u_j\) that is consumed by the collection of \(u_i\)’s private information. In this case, the total privacy budget consumed for \(u_i\) is represented by

$$\begin{aligned} \widehat{\epsilon _i} = \epsilon _i+\sum _{j\ne i}\epsilon _{j,i} \Delta f_{j,i}/\Delta f_j. \end{aligned}$$
(63)

Proof

The information related to \(u_i\) is represented by

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{i,1},\ldots ,x_{i,i-1},x_{i,i+1},\ldots , x_{i,n}\\ x_{1,i},\ldots ,x_{i-1,i},x_{i+1,i}, \ldots , x_{n,i}. \end{array}\right. } \end{aligned}$$
(64)

The value of \(x_i\) is calculated based on the top line of (64); that is, \(x_i=f(x_{i,1},\ldots ,x_{i,i-1},x_{i,i+1},\ldots ,x_{i,n})\). This value is sent to the server under the privacy budget \(\epsilon _i\).

Each value \(x_{j,i}\) in the lower part of (64) is sent by user \(u_j\) under the privacy budget \(\epsilon _{j,i}\) for \(x_{j,i}\). Because of the sequential composition property of differential privacy [12], the total privacy loss is calculated using (63).\(\square \)

Example 3

Consider the same case described in Example 2. The values of \(x_1,x_2, and x_3\) are 15, 30, and 45, respectively. Consider people \(u_1,u_2\), and \(u_3\) sending their values \(x_1,x_2, and x_3\) under 1-LDP, 2-LDP, and 3-LDP, respectively. In this case, in each report by user \(u_i\), the total privacy losses of \(u_1, u_2\), and \(u_3\) are (1+2/2+3/2 = 3.5), (2+1/2+3/2 = 4), and (3+1/2+2/2 = 4.5), respectively.

Thus far, we have discussed generalized scenarios where the global sensitivity and privacy budget are different for each user. Usually, however, these values are common for all users. In this case, the following theorem holds.

Theorem 8

Consider that there are n users, and each user \(u_i\) sends \(x_i\) under \(\epsilon \)-LDP. In this case, each transfer of data by \(u_i\) consumes \(\epsilon /(n-1)\) of the privacy budget of another user. The total privacy loss of each user \(u_i\) is represented by \(\epsilon + \sum _{j\ne i}\epsilon /(n-1)=2\epsilon \).

In the following text, the expected amount of error of the estimated mean under \(\epsilon \)-LDP in a person-to-person interaction is discussed. We assume that the privacy budget of each user is \(\epsilon \), and that the global sensitivity for each is \(\Delta f\). Let \(\mathcal {L}(x;s)\) represent the probability density function (PDF) of the Laplace distribution with mean 0 and scale parameter s. The probability distribution of the sum of n Laplace random variables is represented by the following equation.

$$\begin{aligned} \begin{aligned} &\mathcal {L}_n(x; s) = \int _{x_1=-\infty }^\infty \cdots \int _{x_{n-1}=-\infty }^\infty \\ &\mathcal {L}(x_1; s)\cdots \mathcal {L}(x_{n-1}; s)\mathcal {L}(x - \sum _{i=1}^{n-1}x_i; s) dx_1\cdots dx_{n-1}\\ &=\frac{e^{-\frac{|x|}{s}} \displaystyle \sum _{i=0}^{n-1} a_{n,i} s^i |x|^{n-i-1}}{2\,s^n \displaystyle \prod _{i=1}^{n-1}2 i}, \end{aligned} \end{aligned}$$
(65)

where

$$\begin{aligned} a_{n,i} = {\left\{ \begin{array}{ll} 0 &{} (n=i \text {\,\,or\,\,} i=-1) \\ 1 &{} (n=1 \text {\,\,and\,\,} i=0) \\ a_{n-1, i-1} (n+i-2) + a_{n-1, i} &{} (\text {otherwise.}) \end{array}\right. } \end{aligned}$$

The resulting value represents the PDF of the summed noise. The expected absolute value of (65) is calculated as

$$\begin{aligned} E[|x| \mathcal {L}_k(x; s)] = 2 \int _{x=0}^\infty x \mathcal {L}_k(x; s)dx = s \prod _{i=1}^{n-1} \frac{2i+1}{2i}. \end{aligned}$$
(66)

The value of (66) represents the expected magnitude of error compared with the true value. The expected magnitude of error is then adjusted based on the desired value. For example, if the server wants to calculate the final average value, the expected magnitude of error is the value of (66) divided by n. When the target mean absolute error (MAE) of the expected average value is \(\theta \), the value of s should be

$$\begin{aligned} s = n \cdot \frac{\theta \sqrt{\pi } \Gamma (n)}{2 \Gamma (1/2+n)}. \end{aligned}$$
(67)

The expected squared error is calculated using

$$\begin{aligned} E[x^2\mathcal {L}_n(x; s)] = 2ns^2. \end{aligned}$$
(68)

If the server wants to calculate the final average value, the value of (68) is divided by \(n^2\). When the target mean squared error (MSE) of the expected average value is \(\theta '\), the value of s should be

$$\begin{aligned} s = n \sqrt{\frac{\theta '}{2n}} = \sqrt{\frac{n \theta '}{2}}. \end{aligned}$$
(69)

Algorithm 4 describes our proposed method.

A 14-line algorithm for collection and analysis of L D P data in a person-to-person interaction. The inputs are delta f and target M A E theta or target privacy budget epsilon. The expected average value is the output. The statements include the processes of the data collection server and each user.

If a target MSE is desired, Line 3 in Algorithm 4 is replaced by \(s \leftarrow \sqrt{n \theta '/2}\).

5.5 Evaluation

5.5.1 Datasets

Initially, we created synthetic datasets that followed normal, uniform, and delta distributions with values ranging from 0 to 100.

In addition, we assessed four real datasets. The first dataset was an email dataset [34]. The first dataset was an email dataset [5]. We examined the sender, recipient, and content of each email in the dataset, and we identified 19,753 unique email addresses. Furthermore, we tallied the number of swear words used by each user. We sourced the list of swear words from https://www.noswearing.com/, which has been utilized in numerous studies (e.g., [4, 31, 46]). Data on the number of swear words sent by each user was gathered under \(\epsilon \)-LDP.

The second dataset was a who-trusts-whom network dataset [50]. Under \(\epsilon \)-LDP, we collected information on the number of users trusted by each user. The dataset contained 36,692 users, with trust values ranging from 1 to 3,044.

The third dataset consisted of observational contact data from 86 rural Malawian residents [44]. Participants wore sensors in pouches on the front of their clothing to detect close proximity. A “touch event” between two individuals was identified when their devices exchanged about one radio packet across 20 time intervals. After contact was established, it was deemed continuous if no more than one radio packet was exchanged every second during the following 20-second interval. Each device had an ID number that linked to the contact information of the person carrying the device.

The fourth dataset documented face-to-face interactions among 405 participants at the SFHH conference in Nice, France, held in 2009 [20]. Each participant had a device that sent wireless packets at regular intervals, using temporary addresses assigned to the device. The devices could detect face-to-face encounters at an approximate distance of 1 m.

Table 8 provides an overview of the characteristics of these four datasets.

Table 8 Real datasets

5.5.2 Evaluation Results

We evaluated the effectiveness of our proposed method using synthetic and real datasets. We compared the proposed method with the DCP, PWP, and SW methods proposed in previous studies [35, 80] (see Sect. 5.2). Because these methods do not assume a person-to-person interaction scenario, it is necessary to derive a method for setting the value of the privacy budget.

For DCP, the maximum value of ratio \(Pr(\epsilon ,x)/Pr(\epsilon ,x')\) based on Equation (58) is \(e^\epsilon \) when \(x,x'=1,-1\). In our scenarios, the range of x depending on \(x_{i,j}\) is not 2 but \(2/(n-1)\). In this case, the maximum ratio is represented by

$$\begin{aligned} \gamma (\epsilon , n) = \frac{Pr(\epsilon , -1+2/(n-1))}{Pr(\epsilon , -1)} = \frac{e^\epsilon + n -2}{n-1}. \end{aligned}$$
(70)

Therefore, other than for \(u_i\), the privacy budget \(\log \gamma (\epsilon , n)\) is consumed. If the total privacy loss should be \(\epsilon \), the privacy budget for \(x_i\) should be set to the value obtained using the following equation for \(\epsilon '\):

$$\begin{aligned} \epsilon ' + (n-1) \log \gamma (\epsilon ', n) = \epsilon . \end{aligned}$$
(71)

It is difficult to solve (71) algebraically, but it can easily be solved numerically.

For PWP [80] and SW [35], the consumed privacy budget of \(u_j\) is also \(\epsilon \) when user \(u_i\) sends the \(\epsilon \)-LDP value of \(x_i\) to the server. Therefore, when n users send their LDP values to the server, the value of the privacy budget should be \(\epsilon /n\) to ensure \(\epsilon \)-LDP.

We experimentally evaluated the MSE and MAE. However, due to the space constraint, only the MSE results are shown in this section. The trends in the MSE results and MAE results were very similar. We repeated each experiment 1,000 times and obtained the average value. The range of \(\epsilon \) was set to [1, 20] based on [7, 74]. In several existing studies, \(\epsilon \) was set to smaller values. In practice, the range [1, 20] is sufficient for \(\epsilon \). In the setting we used for the synthetic datasets, each true value existed in the range [0, 100]. When \(\epsilon \) was 1, the average amount of the Laplace noise per user was 200. The noise was large enough to ensure that the true value was not recognizable at all. When \(\epsilon \) was 20, the average amount of Laplace noise per user was 10. Although the privacy protection level was relatively low, this value would be sufficient in some cases. The range of n (number of users) was set to [100, 10000]. The default values of \(\epsilon \) and n were set to 10 and 1000, respectively.

Fig. 15
6 line graphs for normal, uniform, and peak distances. a to c. M S E versus epsilon for varying epsilon with decreasing lines for P W P, D C P, and proposal. d to f. M S E versus the number of users. P W P increases. D C P and proposal decrease for varying n. S W is constant in all graphs.

Mean squared error (MSE) results for synthetic datasets

The MSE results for the synthetic datasets are shown in Fig. 15. The results obtained with a varying value of \(\epsilon \) are shown in Fig. 15a–c. The results of Proposal (math) represent the results of the mathematical analysis in (66) and (68). The results of PWP and SW are worse than those of the other methods. This is because when a user \(u_i\) sends an \(x_i\) value under \(\epsilon '\)-LDP, this behavior consumes \(\epsilon '\) of \(u_i\)’s privacy budget and \(\epsilon '\) of every \(u_j\)’s privacy budget. DCP performed well when \(\epsilon \) was small. However, for larger \(\epsilon \), the proposed method proved to be more effective than DCP. Originally, the DCP did not perform well when \(\epsilon \) was large [7]. Figure 15d–f show the results for different numbers of users. As the number of users increases, the amount of noise accumulated increases. However, if the noise added to each value is not too large, they cancel each other out, and the effect of each noise addition is mitigated. Owing to this tradeoff, the MSE increases or decreases depending on the method. For the proposed method and DCP, the MSE decreased as the number of users increased, because each noise addition was relatively small. In contrast, as PWP had larger noise values, the predicted average MSE increased with the number of users. For all datasets, the results were very similar to each other. As can be seen from Equations (66) and (68), the values of MSE and MAE do not depend on the content of the dataset but on the number of users and the value of \(\epsilon \).

The experimental results for the MSE are shown in Fig. 16. The performances of DCP and the proposed method were better than those of the other methods for all datasets. It is difficult to read the differences between DCP and the proposed method in Fig. 16, but there are significant differences in the MSE values. When \(\epsilon \) was 10, the proposed method reduced the MSE by 47%, 40%, 66%, and 62%, respectively, compared with DCP.

Even if the amount of noise added to each value is large, the accuracy of the estimation can be increased by collecting a large amount of user data. Therefore, regarding the two large datasets (the e-mail and who-trusts-whom network datasets), the difference between the proposed method and other methods was relatively small. However, regarding the two small datasets (the Village and SFHH datasets), it was difficult for all the methods to estimate the average value with high accuracy. The proposed method is particularly effective in this difficult task with a small number of users.

Fig. 16
4 multiline graphs, a to d, for the e-mail dataset, the who-trusts-whom network dataset, the village dataset, and the S F H H dataset plot M S E versus epsilon. The decreasing lines for P W P are the highest followed by constant S W lines, and decreasing D C P and 2 proposal lines.

MSE results for real datasets. Although the difference between the proposed method and DCP appears small, the proposed method reduces the MSE by 47%, 40%, 66%, and 62%, respectively, when \(\epsilon =10\)

If the server collects data streams from each person, the budget will be small. Therefore, the performances of the proposed method and the DCP are similar in such a case. The performance of the proposed method for small values of \(\epsilon \) will be improved in the future.

6 Conclusion

To create a human-centric digital twin for smart cities, it is crucial to collect extensive information about individuals’ attributes and behaviors while ensuring privacy protection. However, existing privacy-preserving data mining solutions have not adequately addressed measurement noise, missing values, or human interactions, which has led to the loss of data analysis accuracy and privacy leakage. This chapter focused on addressing these challenges, which are particularly pronounced in smart city environments, and utilized local differential privacy (LDP) as the principal metric for evaluating privacy. LDP is a widely adopted privacy-preserving technique that introduces randomness at the data source to provide robust privacy guarantees. By refining the data collection and analysis methods, this chapter was intended to enhance the development of digital twins and the realization of truly smart cities. The proposed system has been demonstrated to achieve higher accuracy and enhanced privacy protection than existing methods through experiments using both synthetic and real-world data, as well as through theoretical analysis. It is believed that this system could serve as a foundation for the realization of more advanced smart cities. Moving forward, plans are in place to conduct pilot studies for the practical implementation of smart city development.