Privacy-preserving data (stream) mining techniques and their impact on data mining accuracy: a systematic literature review

This study investigates existing input privacy-preserving data mining (PPDM) methods and privacy-preserving data stream mining methods (PPDSM), including their strengths and weaknesses. A further analysis was carried out to determine to what extent existing PPDM/PPDSM methods address the trade-off between data mining accuracy and data privacy which is a significant concern in the area. The systematic literature review was conducted using data extracted from 104 primary studies from 5 reputed databases. The scope of the study was defined using three research questions and adequate inclusion and exclusion criteria. According to the results of our study, we divided existing PPDM methods into four categories: perturbation, non-perturbation, secure multi-party computation, and combinations of PPDM methods. These methods have different strengths and weaknesses concerning the accuracy, privacy, time consumption, and more. Data stream mining must face additional challenges such as high volume, high speed, and computational complexity. The techniques proposed for PPDSM are less in number than the PPDM. We categorized PPDSM techniques into three categories (perturbation, non-perturbation, and other). Most PPDM methods can be applied to classification, followed by clustering and association rule mining. It was observed that numerous studies have identified and discussed the accuracy-privacy trade-off. However, there is a lack of studies providing solutions to the issue, especially in PPDSM.


Introduction
Data Mining and machine learning involve extracting knowledge from data, which significantly impacts organizations' growth. Organizations use their past and current data to make decisions to improve their performance or services, where data mining comes to play (Kiran and Vasumathi 2018). This process consists of mining helpful information from raw data and making predictions that support the decision-making (Dutta and Guppta 2016;Dhanalakshmi and Siva Sankari 2014). There are two main approaches for data mining and machine learning: supervised learning and unsupervised learning. These include techniques such as classification, clustering, and association rules (Kiran and Vasumathi 2018;Narwaria and Arya 2016). These methods identify data patterns and produce useful information and predictions that benefit organizations.
The success of the data mining process is measured using the accuracy of data mining results (Paul et al. 2021;Putri and Hira 2017). Accuracy depicts the percentage of patterns learned by the data mining process. For instance, in a classification task, accuracy can be computed as the percentage of correctly classified unknown data records over the total number of records (Nayahi and Kavitha 2017). Higher accuracy leads to improved decision-making. Some studies represent accuracy as "utility" (Feyisetan et al. 2020;Nayahi and Kavitha 2017;Denham et al. 2020;Tsai et al. 2016). Henceforth, we use the term accuracy for consistency.
One of the significant challenges stakeholders have to face in data mining is to protect the individual's privacy in data while using those for data mining (Patel and Kotecha 2017). Datasets may contain some data that data owners do not want to reveal to the outside world (Bhandari and Pahwa 2019). This data is called sensitive data (Qi and Zong 2012). For example, patients' medical history details from a hospital database or customers' bank balance details from a banking database can be considered sensitive data. Sensitive data needs to be protected so that the privacy of the individuals can be preserved in the data mining process.
Defining privacy is not straightforward as accuracy. Privacy depends on the techniques and environment used to measure privacy. A more generic definition for privacy is proposed by Aggarwal and Yu (2008b), which is "the degree of uncertainty according to which original private data can be inferred." Most currently using privacy-measuring metrics assume that some background knowledge of the original data is known to the attacker. The most commonly used method of measuring privacy is by performing attacks on perturbed data to recover original records. Breach probability is another commonly used measure of privacy that compares the error/difference between the original and recovered records with a threshold value. If the error is less than the threshold, it is identified as a breach of privacy (Giannella et al. 2013;Denham et al. 2020).
Privacy-Preserving Data Mining (PPDM) (Malik et al. 2012;Md Siraj et al. 2019; Carvalho and Moniz 2021) has been introduced as a solution to privacy concerns in data mining and has become a prominent area of data mining in the past few decades. PPDM methods should protect the privacy of the data while allowing the data mining process to carry out its duty as usual (Bhandari and Pahwa 2019;Carvalho and Moniz 2021). This means PPDM methods should not cause a considerable impact on the output of the data mining (Malik et al. 2012;Carvalho and Moniz 2021). Two broader categories of PPDM methods, called input PPDM and output PPDM, can be seen in the literature (Kotecha and Garg 2017;Peng et al. 2010). Input PPDM modifies original data before data mining to preserve privacy, while output PPDM deals with modifying the data mining output to preserve privacy. Our work focuses only on input PPDM methods as output PPDM methods mainly involve modifying data mining techniques (classifier or clustering algorithm) and need to be discussed separately. Different input privacy-preserving methods such as perturbation, anonymization, and encryption have been proposed and practiced in the data mining community. These methods positively and negatively impact both privacy and data mining tasks. However, the ultimate expectation of PPDM methods is to protect data privacy so that unauthorized parties cannot identify the individuals using data. And maintain the statistical properties of the data so that it does not degrade the performance of data mining (Denham et al. 2020;Lin et al. 2016;Chen and Liu 2011;Kabir et al. 2007a). So that the transformed and protected dataset can be used for data mining without accessing the original dataset.
Data stream mining involves methods and algorithms to extract knowledge from volatile streaming data (Krempl et al. 2014). Combining privacy-preserving techniques in data stream mining is called Privacy-Preserving Data Stream Mining (PPDSM) Lin et al. (2016), Denham et al. (2020). Mining helpful information and making predictions from data streams have additional concerns due to the behaviour of data streams. Unlike static datasets, streaming data is continuous, transient, and unbounded that needs faster processing (Cao et al. 2011;Chamikara et al. 2018;Martínez Rodríguez et al. 2017). PPDSM methods must cater to the specific behaviour of data streams (Kotecha and Garg 2017;Chamikara et al. 2019), and therefore, privacy preservation in data streams needs to be addressed differently (Cuzzocrea 2017;Tayal and Srivastava 2019).
Most of the PPDM/PPDSM methods proposed have succeeded in preserving the privacy of data but negatively affect the data mining results (Chamikara et al. 2019;Kaur 2017). PPDM methods transform original data values into another form that makes them unrecognizable by outsiders. This process can destroy the statistical properties of the data useful in mining. Therefore, there is a trade-off between data privacy and data mining accuracy (Chen and Liu 2011). Increasing data privacy can decrease the data mining accuracy and vice versa (Paul et al. 2021). Current researches have identified this inherent trade-off between data privacy and data mining accuracy (use as accuracy-privacy trade-off hereafter) and have proposed different PPDM methods to address the issue (Kaur 2017;Wang and Zhang 2007;Soria-Comas et al. 2016;Babu and Jena 2011). Nevertheless, no perfect method has been found to optimize the accuracy-privacy trade-off, and the issue is still open to discussion.
Several existing works study PPDM and PPDSM, but we could not find secondary studies that discuss the accuracy-privacy trade-off in PPDM/PPDSM in detail. This study investigates existing PPDM methods, their strengths/weaknesses and applicable data mining tasks. Subsequently, we consider the unique challenges facing PPDSM and compare current techniques. This leads to a discussion on the accuracy-privacy trade-off and an assessment of how well current PPDM/PPDSM methods address this vital metric.
This systematic literature review makes the following contributions to the area of PPDM/PPDSM. The remainder of this paper has been organized as follows. Section 2 describes the method we followed in carrying out this systematic literature review. We discuss the results and findings of the study in Sect. 3. Finally, we discuss and conclude the knowledge gained from this study in Sects. 4 and 5.

SLR protocol
This study has been carried out as a Systematic Literature Review (SLR), and the study's primary goal is to evaluate methods and techniques proposed in the areas of PPDM and PPDSM. PPDM is a broad area that can be evaluated in many branches. We focus on evaluating PPDM's applicability to data streams and its effect on the accuracy-privacy tradeoff. The rest of this section explains the steps followed in conducting the SLR (Kitchenham et al. 2009).  (Patel and Kotecha 2017;Abdul et al. 2015;Vishwakarma et al. 2016;Malik et al. 2012). Studies such as Kiran and Vasumathi (2018), Nasiri and Keyvanpour (2020), Dutta and Guppta (2016) present frameworks and categorizations of existing PPDM methods to provide the overall picture of the PPDM methods.

Problem identification
Though there are numerous studies on PPDM, only a few are focused on the application of PPDM in data stream mining. Research work such as Tayal and Srivastava (2019), Gomes et al. (2019), Cuzzocrea (2017), Krempl et al. (2014) discuss the challenges, opportunities, and possible future directions in privacy-preserving data stream mining. However, to the best of our knowledge, we could locate only one study (Sakpere and Kayem 2014) that discusses existing PPDM methods specifically for data streams. A few studies (Tran and Hu 2019; Sangeetha and Sadasivam 2019) discuss PPDM methods that can be used in big data in general and have the potential of being used in data streams as it is a category of big data. Therefore, proper evaluation of PPDM methods for data streams is necessary.
The well-known accuracy-privacy trade-off is a concern that still needs to more attention, and a considerable number of studies (Qi and Zong 2012;Narwaria and Arya 2016;Patel and Kotecha 2017;Jain et al. 2016) have identified this issue. But very few (Malik et al. 2012;Vishwakarma et al. 2016;Shanthi and Karthikeyan 2012) have discussed this in detail with respect to different PPDM methods. We find the accuracy-privacy trade-off as an aspect that needs to be discussed, considering both static datasets and data streams.
By analyzing the existing secondary studies, we could confirm a lack of studies that discuss the accuracy-privacy trade-off in PPDM and the existing PPDM techniques specifically for data stream mining. The motivation for our research work arises from this gap, and we try to address these issues in this comprehensive study.

Research questions
After identifying gaps in existing secondary studies related to PPDM, we started our SLR by formulating three Research Questions (RQ) to address those gaps.
• RQ1: What are the existing privacy-preserving data mining methods? -RQ1.1: What are the strengths and weaknesses of the investigated methods? -RQ1.2: What data mining tasks can these methods be used for?
• RQ2: What is the nature of privacy preservation in data stream mining? -RQ2.1: What challenges can be identified when applying PPDM methods for data stream mining? -RQ2.2: What are the PPDM methods that have been proposed for data stream mining?
• RQ3: To what extent do the privacy-preserving data mining approaches identified in answering RQ1 and RQ2 address the trade-off between data privacy and classification accuracy, and what methods have been proposed to optimize the accuracyprivacy trade-off?
Concerning RQ1, we summarize the most common PPDM methods used in the data mining community. To provide a broader insight into existing PPDM methods, we discuss the merits and demerits of each method, along with the data mining tasks they can be used for under the sub-questions of RQ1. Under RQ2, we discuss the applicability of PPDM methods identified in RQ1 in data streams and the different PPDSM methods proposed specifically for data stream mining. Here we also try to identify the challenges in applying PPDM methods in data stream mining as data streams behave differently than static datasets.
By answering RQ3, we aim to determine whether the accuracy-privacy trade-off has received the attention it deserves, as it is a severe concern in the area. We investigate whether the authors have identified or discussed the above issue and the possible steps they have proposed or implemented to reduce the trade-off.

Search process
A systematic manual search was conducted to find out the potential research work. First, we identified search keywords to initiate our search. The search was conducted using three sets of keywords to reduce the complexity of the searching process. Set 1 focuses on selecting articles on privacy-preserving data mining, which may or may not include a discussion about the accuracy-privacy trade-off. We considered the articles with the term "utility" together with PPDM in Set 2, as some researchers prefer using "utility" instead of "accuracy," and those terms are being used interchangeably in the literature. Set 3 selects data stream mining articles about PPDM or privacy.
Five major databases were selected to search: Scopus, IEEE, Science Direct, Springer, and ACM. The initial search was carried out to filter out the potential research work using the search strings mentioned above and the studies using the inclusion and exclusion criteria mentioned in the next section.

Inclusion Criteria (IC) and Exclusion Criteria (EC)
Running the three sets of keywords through five databases resulted in 3923 studies. The following IC and EC were used to select the relevant studies manually to address the formulated research questions. Using this process, we made sure that all the relevant studies were included and irrelevant studies were excluded to increase the effectiveness of the SLR. Figure 1 illustrates the process of selecting the primary studies for SLR. The above-mentioned IC and EC were applied in different steps to filter out the articles that were out of scope considering the defined research questions. This process was carried out manually. For example, after selecting the potential articles from the initial search as the first step, IC1, IC3, EC5, EC6, EC7, and EC8 were applied as the second step. After this, the remaining number of articles could be reduced to 1930. The process was repeated until the most relevant articles remained, which turned out to be 114. We only considered the studies published in the last 20 years , as most studies related to PPDM were published after 2001.

Data extraction and analysis
Data was extracted from each selected article by thoroughly reading the abstract and conclusion and skimming through the rest of the text. The following data was collected.
• Title of the study • Year of publication • PPDM technique/method proposed • Strengths and weaknesses of the proposed method/technique • Accuracy and privacy evaluation metrics • Applicable data mining tasks • Applicability to data streams • Challenges identified on applying PPDM to data streams • Discussion on accuracy-privacy trade-off Collected data were stored and analyzed using MS. Excel to answer the formulated research questions. Figure 2 shows the distribution of the studies selected according to the year of publication. We can observe that many related studies have been published from

Results
This section summarizes the results and findings of our SLR for each research question.

Addressing RQ1-Generic PPDM methods
All the different PPDM techniques and methods found in the final set of articles were studied to answer RQ1 and its sub-questions. There are many categorizations of PPDM proposed in the literature. Authors of Arumugam and Sulekha (2016), Rajalakshmi and Mala (2013) categorized PPDM techniques into two main categories, called Secure Multi-Party Computation and perturbation. In Kaur (2017), PPDM has been divided into five categories namely, Anonymization, Perturbation, Randomization, Cryptography and Condensation.
According to the analysis of extracted data, we agree with the categorization proposed in Tran and Hu (2019) as we believe it is more generic and justifiable. Therefore, we divide existing input PPDM techniques into four main categories: Secure Multi-Party Computation, perturbation methods, non-perturbation methods, and combinations of the above techniques by extending the categorization provided by Tran and Hu (2019). This section discusses all the techniques included in these four categories in detail.

Secure Multiparty Computation (SMC)
The Secure Multi-Party Computation (SMC) methods are being used for collaborative data mining and use cryptographic tools to protect data (Rajalakshmi and Mala 2013). It allows different parties to jointly compute a certain functionality without revealing personal data (Tran and Hu 2019). Therefore, cryptographic methods can be used for distributed privacy and information sharing. This became popular as it provides a well-defined privacy model and the methods for proving and quantifying (Sachan et al. 2013). However, there is a concern that the cryptographic techniques do not protect the output privacy; instead, they stop the leakage of sensitive data in the computation process (Sachan et al. 2013). The data mining community prefers perturbation techniques over SMC techniques because of their lower computational complexity (Chamikara et al. 2021(Chamikara et al. , 2019. Cryptographic methods use encryption schemes that are challenging in scalability and implementation efficiency (Tran and Hu 2019). However, some improved encryption methods such as Park et al. (2022) and Dhinakaran and Prathap (2022b) have been implemented recently with less computational complexity and execution time.

Perturbation methods
This section discusses data perturbation methods that distort data values in specific ways to hide sensitive information while maintaining data properties important for data mining (Chen and Liu 2011). Data perturbation is the most commonly used privacy-preserving technique in data mining because of its simplicity and computational efficiency. In Rajalakshmi and Mala (2013), perturbation has been identified as altering data using statistical methodologies. However, Data perturbation methods have to pay special attention to the accuracy of data mining, as distorting data can highly affect the data mining process. Perturbation can be divided into the value alternation approach and the probability distribution approach (Chidambaram and Srinivasagan 2014). This section discusses the techniques that can be considered data perturbation methods.
Using noise to distort the data is one of the earliest data perturbation methods (Denham et al. 2020). Additive and multiplicative noise are the two main usages of noise in the PPDM context (Chidambaram and Srinivasagan 2014). Random values with zero mean and a specified variance are generated from a given distribution, such as Gaussian or Uniform distribution. Generated noise values are added to each record in additive noise environment, while each record is multiplied with the noise values in multiplicative environment (Denham et al. 2020;Chidambaram and Srinivasagan 2014;Kim and Winkler 2003). The original data values are distorted, while the underlying data distribution can be reconstructed (Kim et al. 2012). If the variance of added noise is high, then a high level of privacy can be expected, but it also causes a high information loss. Later, a combined version of additive and multiplicative noise was proposed in Chidambaram and Srinivasagan (2014). This combined approach guarantees more privacy than individual approaches.
Keke and Ling (Chen and Liu 2005) first proposed a geometric transformation method named random rotation for PPDM-based classification. The original dataset with m attributes is multiplied using a (m x m) random orthogonal matrix (Denham et al. 2020) perturbing all the attributes together (Chen and Liu 2005). A rotation-based approach that only transforms sensitive attributes is proposed in Ketel and Homaifar (2005). Perturbation using rotation transformation is vulnerable to rotation center attacks (Ketel and Homaifar 2005;Chen and Liu 2005), as data closer to the origin is less perturbed than the other data records (Denham et al. 2020). Recently, more improved versions of random rotation, such as 3-D rotation transformation (Upadhyay et al. 2018) and 4-D rotation transformation (Javid and Gupta 2020), have been proposed, and these methods assure high data mining accuracy.
Other geometric data perturbation methods that combine random rotation, translation, and noise addition have been proposed to minimize the vulnerabilities accompanied by rotation transformation (Chen and Liu 2011;Chen et al. 2007). These methods became robust to rotation centre attacks by adding a translation and to distance inference attacks by adding noise. However, it can still be vulnerable to background knowledge-related attacks (Chen et al. 2007).
Differential privacy (Dwork 2008) is a high privacy guaranteed algorithm that works by adding Laplace noise to statistical databases. It ensures that an outsider cannot determine if a data item has been altered. According to Dwork (2008), the result of the dataset is insensitive to the change of a record. Hence makes it difficult for an attacker to gain knowledge about data. Research work such as Mivule et al. (2012), Tang et al. (2019) discuss research work using differential privacy as the PPDM technique.
Tables 1 and 2 summarize different PPDM techniques in noise injection, rotation, differential privacy and other geometric transformations, along with the strengths and weaknesses.
Random Projection (RP) based multiplicative data perturbation was proposed in Liu et al. (2006). RP projects a given dataset from a higher-dimensional space to a lowerdimensional subspace. This method is based on the Johnson-Lindenstrauss Lemma, and pair-wise distances of any two data points can be maintained within a small range (Liu et al. 2006;Denham et al. 2020). So, it can be considered an approximate distance preserving method. The authors of Liu et al. (2006) have stated that RP can be more powerful when used with geometric transformation techniques such as scaling, rotation, and translation. Recently, a random projection-based noise addition method was proposed in Denham et al. (2020). This method experimentally proved high accuracy and privacy levels by combining RP, translation, and noise addition.
Condensation can also be considered as a perturbation PPDM method. It condenses data records into groups of pre-defined size k while maintaining statistical properties within the group (Aggarwal and Yu 2004). It is not possible to distinguish one record from another within the group. Then pseudo data is generated instead of original data using the statistical information within the group. Condensation maintains inter-attribute correlations that guarantee a high accuracy level Yu 2004, 2008a).
Few works such as Meghanathan et al. (2014), Jahan et al. (2016), Cano et al. (2010) have considered using fuzzy logic-based techniques for data perturbation. A fuzzy logicbased perturbation method with less processing time has been proposed in Meghanathan et al. (2014). Though the method's accuracy is similar to the accuracy of the original dataset, privacy needs to be evaluated. A multiplication perturbation method using fuzzy logic has been implemented in Jahan et al. (2016). This method has achieved better accuracy and privacy levels for classification and clustering. Another work that uses fuzzy models for synthetic data generation as a perturbation method can be found in Cano et al. (2010). Table 3 gives an overall idea about different techniques in random projection, condensation, fuzzy logic and some other distortion methods in PPDM.
A considerable number of PPDM methods combine different transformations and data distortion methods to achieve a better performance, considering both privacy and accuracy. Some research work (Peng et al. 2010;Nethravathi et al. 2016;Li and Wang 2011;Putri and Hira 2017;Wang and Zhang 2007;Xu et al. 2006;Hasan et al. 2019;Li and Xue 2018) use transformation techniques such as Singular Value Decomposition (SVD), Non-negative Matrix Factorization (NMF) and Discrete Wavelet Transformation (DWT) to perturb data. In Gokulnath et al. (2015), authors have used Principal Component Analysis (PCA) while  (2018), Kabir et al. (2007a) PCA, together with noise addition, has been used in Mukherjee et al. (2008) for data perturbation. These methods can be summarised in Table 4.

Non-perturbation methods
Non-perturbation methods sanitize the identifiable information to preserve privacy (Tran and Hu 2019), and different anonymization techniques are included in this category. Nonperturbation methods modify or remove only a portion of data (Vijayarani and Tamilarasi 2013), whereas perturbation methods distort each data value. This process uses techniques to make a single record indistinguishable from another set of a specified number of records so that individual records cannot be identified and privacy is preserved. We discuss a set of non-perturbation-based PPDM methods in this section. Anonymization is the most used non-perturbation technique that involves identifying different parts of a data record, such as Identifiers, Quasi-identifiers, and Sensitive and Non-sensitive attributes. Then it removes identifiers and modifies quasi-identifiers by performing techniques such as generalization and suppression, making a record indistinguishable from a set of other records (Tran and Hu 2019). Different anonymization methods can be seen in the literature, such as k-anonymity (Sweeney 2002), l-diversity (Machanavajjhala et al. 2007) and t-closeness (Li and Venkatasubramanian 2007).
The basic method of anonymization, k-anonymity, ensures that a single data record cannot be distinguished from at least k-1 records (Sweeney 2002;Tsai et al. 2016). Identifying different parts of a data record is essential here, and then applying generalization and suppression techniques to achieve k-anonymized set of data. This method reduces the risk of a re-identification attack caused by the direct linkage of shared attributes (Tsai et al. 2016). The main weakness of the method is that it assumes that no two tuples contain data of the same person, which may not always be true (Sweeney 2002).
Another weakness of k-anonymity is that it can be vulnerable to background knowledge-based attacks such as complementary release attacks and Temporal inference attacks. As a solution to this, an improved anonymization model called l-diversity was introduced (Machanavajjhala et al. 2007). A table is called l-diverse if there are l well-represented values for the sensitive attribute (Wang et al. 2009). The method provides privacy even when the data owner does not know what kind of knowledge the attacker has. However, it is difficult to implement for multiple sensitive attributes (Machanavajjhala et al. 2007) and vulnerable to attacks such as similarity attacks (Wang et al. 2009).
Another anonymization method named t-closeness was proposed in Li and Venkatasubramanian (2007). The requirement to achieve t-closeness is maintaining the distribution of a sensitive attribute in an equivalence closer to the distribution of the same attribute in the overall table. If the distance between two distributions less than the threshold t, it has achieved the t-closeness (Li and Venkatasubramanian 2007;Soria-Comas et al. 2016). This overcomes the skewness and similarity attacks but cannot deal with identity disclosure attacks and multiple sensitive attributes (Li and Venkatasubramanian 2007). There are several more variations of anonymization such as p-sensitive, t-closeness (Sowmyarani et al. 2013) have been proposed in addition to these main methods as solutions to privacy issues of the existing methods.
The main issue with all these anonymization methods is that there is no specific computational approach to determine what data should be anonymized. This entirely depends on the expertise knowledge (Sowmyarani et al. 2013). Different anonymization techniques, along with their strengths and weaknesses, can be found in Table 5.

Methods combining cryptographic, perturbation and non-perturbation techniques
Privacy-Preserving Data Mining methods that use different combinations of the abovediscussed techniques are being used to preserve privacy. The reason for proposing these combing methods is to use the benefits of each method by reducing or eliminating the weaknesses. In Kaur (2017), authors have proposed a hybrid PPDM method by combining perturbation and anonymization. This method uses additive noise and suppression techniques to achieve a minimum loss by avoiding the generalization involved in the anonymization. An improved PPDM method was proposed in Poovammal and Ponnavaikko (2009) (2021) are suitable for Association Rule Mining. There are improved methods Yu 2004, 2008a;Li and Xue 2018;Meghanathan et al. 2014) that can be used for more than one data mining task. Table 6 briefs the PPDM methods that combine perturbation, non-perturbation and cryptographic techniques.
The distribution of PPDM methods among data mining tasks can be seen in Fig. 3. Most PPDM methods can be applied to classification, followed by association rule mining. While comparably fewer methods have been proposed specifically for clustering algorithms, a considerably high number of PPDM methods can be used for more than one data mining task (classification and clustering, classification and association rule mining). Moreover, some PPDM methods can be used with any data mining algorithm.
For PPDM methods, we have reviewed their strengths and weaknesses. Most methods are vulnerable to attacks related to background knowledge, and the researchers have identified this problem. Though we cannot provide a simplified categorization of strengths and weaknesses, we have pointed out the strengths and weaknesses of the reviewed methods in Tables 1, 2, 3, 4, 5 and 6. These tables provide a comprehensive answer to the sub-questions in RQ1 by summarizing all the different PPDM techniques we found and the strengths, weaknesses, and challenges.

Addressing RQ2-Privacy-Preserving Data Stream Mining (PPDSM)
This section discusses the PPDM methods that can be applied to data streams and the challenges we have to overcome when successfully applying PPDM methods to data streams.

Challenges in data stream mining
Most generic PPDM methods discussed in Sect. 3.1 cannot be directly applied to data streams due to the challenging behavior. There are three principal challenges in mining data streams named volume, velocity, and volatility (Krempl et al. 2014;Tran and Hu 2019). Data streams have numerous challenges to consider, such as data preprocessing, analyzing complex data, dealing with delayed data, and handling concept drift (Krempl et al. 2014;Gomes et al. 2019). Privacy is only one concern that data stream mining has to focus on. Data streams are continuous, transient, and usually unbounded (Wang et al. , 2018Martínez Rodríguez et al. 2017) in nature. Mining data streams is a continuous process, and it cannot be redone as done for the static datasets because it is not possible to access the full set of data at once Khavkin and Last (2019). Data may reach a high speed, and therefore fast execution is needed (Chamikara et al. 2018;Lin et al. 2016). Privacy preservation needs to be performed quickly, and incoming data should be released with a minimum delay. Due to the unbounded nature of the data streams, PPDM methods should be able to cope with a massive volume of data with a fast execution time (Denham et al. 2020). Computer memory is too small relative to the vast data volume, and all data cannot be stored . Another challenge of data streams mining is the concept-drift (Cuzzocrea 2017; Gomes et al. 2019;Tayal and Srivastava 2019) and it affects the PPDM process. Underlying data distribution can change with time, and data mining models should be able to adapt to the concept drift to achieve a good accuracy level (Zhang and Li 2019; Khavkin and Last 2019). Privacy preservation methods should be able to cope with the effects of the concept drift.
Considering all these facts, data stream mining and privacy preservation are two conflicting tasks (Kotecha and Garg 2017). The data stream mining should quickly cope with the memory restrictions, while generic privacy preservation methods require multiple scans over the data, which is time and memory-consuming.

PPDM methods for data stream mining
The possibility of applying proposed methods to data streams or difficulties of adapting the methods to data streams have not been discussed in generic PPDM methods except in a few. Authors of Kadampur and Somayajulu (2008) have mentioned that the proposed field rotation and binning method cannot be applied to data streams. All data should be presented to the binning process, which cannot be done for data streams because of their incremental behavior. The combined noise perturbation method proposed in Chidambaram and Srinivasagan (2014) is also challenging to adapt to data streams. According to the authors, the concept of multi-level trust used in this combined perturbation method is challenging to implement for data streams. The condensation-based PPDM method proposed in Aggarwal and Yu (2008a) is the only method we could find from the selected articles that discuss the possibility of applying the method for data streams. It is suitable for both static data and dynamic data streams. However, for infinite data streams, there is a need for a mechanism to store a fixed number of condensed groups (Aggarwal and Yu 2008a).
PPDSM methods implemented specifically for data streams and modified versions of generic PPDM methods for use in data streams can be seen in the literature. These PPDSM methods have been designed to overcome the above-discussed common challenges of the data streams. We divided these methods into Perturbation, Non-perturbation (Anonymization), and others, based on the main techniques they used. Most methods are based on anonymization-based non-perturbation methods followed by perturbation methods. A small proportion of PPDSM methods use different other distortion techniques such as differential privacy, fuzzy logic, and PCA. Though we roughly categorize these methods into these categories, we observed that there is no clear boundary to define this. Most of these methods are combinations of different categories.
Anonymization-based non-perturbation methods are among the most used PPDSM techniques for data stream mining. An anonymization method called FAST was presented in Mohammadian et al. (2014) for the fast execution of privacy preservation in data streams with less information loss. This method uses a multithreading technique through k-anonymization and can be used for clustering. Another PPDM method for clustering using k-anonymization for data streams has been introduced in Mohamed et al. (2017). This method is scalable and can be used with less communication cost and less information loss for distributed data streams. A continuously anonymizing method called "CAS-TLE" has been implemented using k-anonymity and l-diversity in Cao et al. (2011). CAS-TLE can manage outliers and release data with a minimum delay but can be vulnerable to inference-related attacks. Microaggregation-based differential private anonymization has been proposed for classification in Khavkin and Last (2019). This method deals with concept drift by applying Kolmogorov-Smirnov statistical test and minimizes the information loss and possible disclosure risks. The privacy preservation method discussed in Rajalakshmi and Mala (2013) uses a frequency discretization technique similar to anonymization. Moreover, sliding window-based anonymization methods were discussed in Wang et al. (2018, Navarro-Arribas and Torra (2014). The fast anonymization method proposed in Wang et al. (2018) can be used for associate rule mining, and the method in  facilitates high-speed data processing with small memory requirements. Anonymization based on rank swapping in a sliding window discussed in Navarro-Arribas and Torra (2014) can reduce information loss by swapping selected tuples from the sliding window but can be impractical for infinite data streams.
Perturbation based privacy preservation methods proposed for data stream mining are Chamikara et al. (2018), Martínez Rodríguez et al. (2017, Virupaksha and Dondeti (2021), Denham et al. (2020), Rajalakshmi and Mala (2013). In Chamikara et al. (2018), "P2Ro-CAl", a combination of Condensation, rotation, and random swapping has been proposed for data stream classification. P2RoCAl offers a better accuracy level than similar methods and is robust to data reconstruction attacks, but condensed group size can affect the performance. It can be used in static environments as well. Statistical Conflict of interest Control (SDC) with different filters such as noise addition, micro-aggregation, rank swapping, and differential privacy has been used in Martínez Rodríguez et al. (2017). However, the noise addition filter has the risk of disclosure. An anonymization method based on noise addition for privacy preservation has been proposed in Virupaksha and Dondeti (2021) for clustering. This method chooses random noise within the subspace limits of the dense and non-dense subspaces to reduce information loss and enhance cluster identification. Random projection-based cumulative noise addition implemented in Denham et al. (2020) combines three perturbation techniques, random projection, translation, and noise addition, to achieve good accuracy and privacy. In addition to traditional independent noise addition, authors of Denham et al. (2020) have introduced a novel noise addition method that seems promising in performance. A random projection-based encryption method discussed in Rajalakshmi and Mala (2013) provides a low computational cost with a good privacy level.
Other privacy-preserving methods proposed for data stream mining use different techniques such as fuzzy logic and PCA (Rajesh et al. 2012), differential privacy (Chamikara et al. 2019;Katsomallos et al. 2022;Gondara et al. 2022), sliding window (Lin et al. 2016), and hashing (Nyati et al. 2018). These methods try to overcome the challenges in data stream mining by using different techniques. We observed that there are numerous methods proposed for PPDSM that fall under the category of output PPDSM (Kotecha and Garg 2017;Zhang and Li 2019), which is out of our scope. Figure 4 provides an idea of how data stream-based PPDM methods are spread over different data mining tasks. Like the generic PPDM methods, most PPDSM can be applied to classification. The second-highest applicability was achieved by clustering. A few PPDSM methods have been proposed specifically for association rule mining. Some PPDSM methods can be applied to more than one data mining algorithm, which is a good sign.

Addressing RQ3-accuracy-privacy trade-off
The accuracy-privacy trade-off is the most common issue in PPDM/PPDSM and should be addressed appropriately to get maximum performance. If not, the objective of PPDM methods, which is effectively protecting private data while maintaining the knowledge in original data (Lin et al. 2016;Denham et al. 2020), can be violated. It was observed that lots of researchers have identified and discussed this trade-off, while some studies try to provide possible solutions. In this section, by answering RQ3, we discuss to what extent the accuracy-privacy trade-off has been addressed.

Accuracy-privacy trade-off in generic PPDM methods
Metrics of accuracy and privacy are helpful when understanding the trade-off between those two properties. Tables 7 and 8 compile different evaluation metrics and measures used to calculate privacy and accuracy for generic PPDM methods. The most commonly used measure of accuracy is the error/accuracy of the data mining task. A few privacy preservation methods, such as anonymization and rule hiding, use different techniques to measure accuracy. Differential privacy and privacy after various attacks are the most used methods. Moreover, some methods have used metrics such as VD, RP, RK, CP and CK to measure privacy. However, we can see that measuring privacy and accuracy are mostly specific to data mining and privacy preservation techniques.
Regarding the accuracy-privacy trade-off discussion, We first look at the generic PPDM methods discussed in Sect. 3.1.
Research work such as (Putri and Hira 2017;Vijayarani and Tamilarasi 2011;Lohiya and Ragha 2012;Alotaibi et al. 2012;Upadhyay et al. 2018;Chamikara et al. 2020;Kiran and Vasumathi 2020;Tsiafoulis et al. 2012;Arumugam and Sulekha 2016) have identified the existing accuracy-privacy trade-off while (Zaman et al. 2016;Peng et al. 2010;Nethravathi et al. 2016;Chidambaram and Srinivasagan 2014;Javid and Gupta 2020;Liu et al. 2019;Sowmyarani et al. 2013;Xiaoping et al. 2020;Sun et al. 2014;Aggarwal and Yu 2008a;Upadhayay et al. 2009) discuss the matter in detail. Accuracy-privacy trade-off w.r.t. rotation perturbation has been discussed with extensive experimental results in Chen and Liu (2005). Authors of Giannella et al. (2013) have discussed the nature of this trade-off in the distance preserving PPDM methods with examples. Accuracy-privacy behavior of p-sensitive, t -closeness was discussed in Sowmyarani et al. (2013). In this method, When p decreases, utility also decreases, but good in high t values. Accuracy-privacy trade-off of additive multiplicative perturbation has been discussed in Teng and Du (2009). This method shows that the error and privacy increase when the trust level increases, which denotes a trade-off between accuracy and privacy. Confidence interval metric Aggarwal and Yu (2008a) Classification accuracy Using different condensed group sizes Ketel and Homaifar (2005) Correlation metric Differential entropy Giannella et al. (2013) -Breach probability using attacks Sweeney (2002) -Based on attacks Chen and Liu (2005) Classification accuracy Multi-column privacy metric Li and Venkatasubramanian (2007) Average group size, discernibility metric - Carvalho and Moniz (2021) Precision, recall, F-score Re-identification risk Yang and Liao (2022) Loss rule rate Hiding failure rate Numerous research work has proposed different solutions to optimize or reduce the accuracy-privacy trade-off. Combining suppression and perturbation to minimize the loss caused by generalization in anonymization has been proposed in Kaur (2017). Performing the perturbation only on sensitive attributes to achieve a high accuracy level while maintaining good privacy using NMF and SVD has been discussed in Wang and Zhang (2007). The t-closeness anonymization proposed in Li and Venkatasubramanian (2007) discuss how the t parameter can be tuned to achieve a good trade-off, while (Soria-Comas et al. 2016) tries to achieve a better trade-off by applying t-closeness through micro-aggregation. Anonymization-based clustering methods proposed in (Nayahi and Kavitha 2017;Babu and Jena 2011) shows that the number of clusters formed determines the trade-off as the number of clusters increases, accuracy increases, and privacy decreases. Authors of Mukherjee et al. (2008) try to achieve a better accuracy-privacy trade-off by combining PCA and additive noise, but privacy increases while accuracy decreases when more noise is added. The differential privacy-based approach proposed in Mivule et al. (2012) uses an ensemble classifier to calculate the error, which is repeated until a pre-defined threshold is achieved. The Laplace noise added for the differential privacy is re-adjusted if it cannot be achieved.
A rotational transformation was combined with a translation implemented in Singh and Batten (2013) to get a good privacy level with a low accuracy loss. Authors of Teng and Du (2009) try to balance the trade-off between accuracy and privacy using a multi-group approach. The perturbation method "NRoReM" (Paul et al. 2021) has been implemented to optimize the accuracy-privacy trade-off by combining normalization, geometric rotation, linear regression, and scalar multiplication. In addition to the above discussed methods, research work such as Feyisetan et al. (2020); Hasan et al. (2019); Kabir et al. (2007a); Li and Xue (2018); Kim et al. (2012) and Chamikara et al. (2021) also implemented different techniques to optimize the accuracy-privacy trade-off.
Some research work have proposed interesting PPDM techniques but have not paid much attention to the accuracy-privacy trade-off (Tsai et al. 2016;Tang et al. 2019;Ashok and Mukkamala 2011;Hong et al. 2010;Meghanathan et al. 2014;Li and Wang 2011;Kaur and Bansal 2016;Gokulnath et al. 2015;Oishi 2017;Lin et al. 2015;Miyaji and Rahman 2011;Ketel and Homaifar 2005;Hong et al. 2011). Table 9 presents evaluation metrics used to measure accuracy and privacy in data streaming environments. Information loss and classification accuracy are the most frequently used accuracy evaluation metrics. Differential privacy and calculating breach probability by performing attacks can be identified as the most commonly used privacy measures in data stream mining environments.

Accuracy-privacy trade-off in data stream mining
Some PPDM methods proposed for data streams have also attempted to optimize the accuracy-privacy trade-off. According to the challenges identified in 3.2.1, it is clear that handling the trade-off issue in the streaming environment is rather complex. However, some methods have discussed this issue (Gitanjali et al. 2010;Lin et al. 2016;Cao et al. 2011) while some have tried to address it using different techniques (Zhang and Li 2019;Khavkin and Last 2019;Chamikara et al. 2019Chamikara et al. , 2020Denham et al. 2020).
Sequential Backward Selection (SBS) of the greedy algorithm and k-fold cross-validation to select the optimal mode in NB classification has been used in Zhang and Li (2019) to achieve a balanced accuracy-privacy trade-off. Micro-aggregation based differential Table 9 Accuracy and privacy evaluation metrics for data streams Breach probability Gondara et al. (2022) Classification accuracy, MSE Differential privacy Katsomallos et al. (2022) Mean absolute error Temporal privacy loss private stream anonymization has been proposed in Khavkin and Last (2019), and the trade-off has been evaluated using disclosure risk and Area Under the Curve (AUC) of the classifier. The differential privacy-based PPDM method "SEALdou" Chamikara et al. (2019) has been proposed as a solution to the accuracy-privacy trade-off in data stream mining. To optimize the trade-off, it provides flexibility to select privacy parameters according to the domain and dataset to improve privacy and maintain the shape of the original data distribution after noise addition to improve accuracy. P2RoCAl (Chamikara et al. 2020) tries to achieve the same goal by combining condensation and rotation. Random projection-based cumulative noise addition (Denham et al. 2020) tries to add noise with a small variance cumulatively to minimize the effect to accuracy while maintaining good privacy. This method has experimentally proven to achieve a better trade-off. Authors of Hewage et al. (2022) have proposed a novel random projection-based noise addition method using (Denham et al. 2020) as the base technique. It uses the effect of logistic function to control the noise level but still adds it cumulatively to achieve a high accuracy level. Meanwhile, some interesting PPDM methods in data stream mining do not include any discussion about accuracy-privacy trade-off (Mohammadian et al. 2014;Rajesh et al. 2012;Mohamed et al. 2017;Martínez Rodríguez et al. 2017;Nyati et al. 2018;Rajalakshmi and Mala 2013;Navarro-Arribas and Torra 2014;Virupaksha and Dondeti 2021;Wang et al. 2018. Figure 5 gives an overall idea about to what extent the PPDM research community has paid attention to the accuracy-privacy trade-off. It can be seen that there is a lack of attention to the accuracy-privacy trade-off in data stream-based PPDM methods relative to the generic PPDM methods. However, in both areas, some research work tries to solve this issue (30.26% in generic PPDM and 23.81% in Stream-based PPDM).

Discussion
In this section, we discuss how we addressed the gaps in the existing secondary studies by answering the formulated research questions. In RQ1 and its subsections, we tried to identify the generic PPDM methods, the strengths and weaknesses of the identified methods, and the applicable data mining tasks. There are two broad categories of PPDM methods called input and output PPDM. We considered input PPDM methods as output PPDM methods focus on changing data mining output, which is a different scenario. After reviewing selected primary studies, it was found that a plethora of techniques has been proposed for privacy preservation in data mining. We divided them into four main categories: Secure Multiparty Computation, perturbation methods, non-perturbation methods, and combinations of the above categories. These methods cover most supervised and unsupervised learning techniques (Classification, Clustering, Association Rule Mining) in data mining. Most of the PPDM methods can be used for classification, and a considerable number of PPDM methods can apply to more than one data mining algorithm (Refer Fig. 3). Also, we observed that several studies lack standard accuracy and privacy evaluations after applying a data mining algorithm. This is a section where PPDM studies can be improved. Applying new techniques to a data mining algorithm using real or synthetic datasets provides validity and clarity and helps identify the sections needing improvement.
These generic PPDM methods have different strengths and weaknesses considering privacy, accuracy, time consumption, and more. The main reason for this is the nature of the techniques used to preserve privacy. Different techniques affect different characteristics differently. Because of this variety of advantages and weaknesses, when selecting a PPDM method, several factors such as the size of the dataset, domain, and contained sensitive data should be considered. There are no pre-defined criteria to decide on the appropriate PPDM methods for a specific dataset. Properties of the dataset and the characteristics of the privacy preservation technique should be considered when making this decision. All the facts found from the review have been summarized in Table 1 to 6.
A categorization model of all the input PPDM methods was created by analyzing the extracted data and considering the existing categorizations. This model helps to grasp the overall picture of existing PPDM methods. Figure 6 illustrates the categorization model created, summarizing all the generic PPDM techniques.
For RQ2, we investigated the applicability of PPDM techniques for data stream mining. It was observed that most of the generic PPDM methods could not be used directly for data stream mining because of the challenging nature of data streams. This includes incremental nature, high speed, vast or infinite data, and possible concept drifts. Therefore, generic PPDM methods need improvements and amendments to be successfully used in PPDSM. It was observed that most of the generic PPDM methods do not discuss its applicability to data streams, which we suggest is something to be considered. If there is such discussion, it would be helpful for future development. Numerous PPDSM methods proposed for data stream mining improve existing generic PPDM methods or combinations of different techniques. Most of these methods use anonymization techniques to preserve privacy in data streams. Perturbation methods such as noise addition are also applicable because noise can be added independently to a single record at a time. While these methods could overcome most of the challenges in data stream mining, they still have some concerns, such as concept drift handling and time and computational complexity, that need to be improved. We also looked into the applicability of PPDSM methods in different data mining techniques. It was found that the majority of proposed PPDSM methods can be used for classification, followed by clustering. Interestingly, several PPDSM methods are available for more than one data mining task, which shows the generalizability of PPSM methods in data stream mining (Refer Fig. 4).
The well-known trade-off between data mining accuracy and data privacy was investigated in answering RQ3 to determine how it has been addressed. A majority of research work has identified and discussed the accuracy-privacy trade-off, but little effort has been made to propose techniques to address the issue. The conflicting nature of accuracy and privacy is the main reason for this. Though some methods have been proposed to optimize the accuracy-privacy trade-off, it is impossible to simultaneously achieve ideal values for both measures. Some methods have achieved this to some extent, but there is still considerable room for improvement. The techniques proposed to optimize the trade-off in data stream mining are less in number compared to generic PPDM methods (Refer Fig. 5). The proposed methods to optimize the trade-off include different techniques and methods. That includes preserving more statistical information, making changes only to sensitive attributes, parameter optimization, and considering users' privacy requirements. We believe optimizing the accuracy-privacy trade-off has not received the attention it deserves, especially in PPDSM.

Conclusions and future directions
A significantly higher number of works address privacy in generic data mining as compared to techniques that apply to stream-based data mining. A positive remark is that these proposed PPDM methods can be used in different data mining algorithms in both supervised and unsupervised learning. However, all these methods have strengths and weaknesses due to the techniques used to preserve privacy. Though the PPDM research community has identified the trade-off between data mining accuracy and data privacy, there is a lack of research that tries to implement techniques with extensive experimentation to optimize this trade-off. Especially, PPDSM has a great need of techniques to optimize the accuracy-privacy trade-off in data stream mining. All our findings from this study can be listed as follows; 1. A plethora of studies propose different privacy-preserving techniques for PPDM.
• The existing generic PPDM methods can be divided into four categories, namely SMC, perturbation, non-perturbation, and combinations of the above. • These PPDM methods can be used for different data mining algorithms, including classification, clustering, and association rule mining. Numerous methods work well on more than one data mining algorithm. • The existing PPDM methods have different strengths and weaknesses in several areas, including accuracy, privacy, and time complexity. These are caused by the techniques used to preserve privacy.
2. Different studies have been implemented to preserve privacy in data stream mining • Data streams behave differently than static datasets due to characteristics such as high volume, high speed, and concept drift. Therefore, privacy preservation in data stream mining is rather challenging. • Most of the generic PPDM methods cannot be used for PPDSM and need improvements to adapt to the behavior of data streams.
3. The trade-off between data mining accuracy and data privacy is one of the main issues in PPDM that needs more attention.
• Evaluating accuracy is straightforward. However, privacy evaluation is a complicated task. Generally, this depends on the data mining technique and privacy preservation technique. • The most used accuracy evaluation metric is data mining accuracy, while privacy is measured by performing attacks or using other metrics such as differential privacy. • Many studies have identified and discussed the accuracy-privacy trade-off in PPDM. • Numerous studies have proposed and improved advanced PPDM techniques to optimize this trade-off in generic PPDM. • There are only a few studies that focus on optimizing the accuracy-privacy tradeoff in PPDSM.
We only considered input PPDM methods in this study. Therefore, as a future direction, we would like to suggest an investigation on output PPDM methods and how they can optimize the accuracy-privacy trade-off as it often is used in data stream mining.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions.

Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.