Differentially private tree-based redescription mining

Differential privacy provides a strong form of privacy and allows preserving most of the original characteristics of the dataset. Utilizing these benefits requires one to design specific differentially private data analysis algorithms. In this work, we present three tree-based algorithms for mining redescriptions while preserving differential privacy. Redescription mining is an exploratory data analysis method for finding connections between two views over the same entities, such as phenotypes and genotypes of medical patients, for example. It has applications in many fields, including some, like health care informatics, where privacy-preserving access to data is desired. Our algorithms are the first tree-based differentially private redescription mining algorithms, and we show via experiments that, despite the inherent noise in differential privacy, it can return trustworthy results even in smaller datasets where noise typically has a stronger effect.


Introduction
Modern data analysis methods have had a huge impact on analysing health and bioinformatics data, with personalised medicine (Fröhlich et al, 2018), computer-aided diagnoses (Van Ginneken et al, 2001), and new drugs (Napolitano et al, 2013) being just some examples.Exploratory data analysis has still much more to contribute to these fields, but the access to the data is often restricted due to the sensitivity of health information.The standard procedure for requesting access to such data typically involves submitting a detailed research plan, which is often required to take the form of a confirmatory study, meaning that the researchers must present a hypothesis to be tested.This standard procedure is ill-suited for exploratory analysis, which relies on trying out a number of different approaches in order to see whether the data reveals some novel knowledge.As a consequence, requests for permission to access data for exploratory research often get rejected.
The particular method of exploratory data analysis that we focus on in this paper is redescription mining.It aims at finding pairs of queries over the data attributes such that the queries describe approximately the same sets of entities.It can be used, among other things, to find connections between patient diagnoses and lab test results; for instance, finding that patients with following laboratory results 1. normal total iron binding capacity, total CO 2 and hemoglobin not tested; or 2. abnormal total CO 2 and normal eosinophils levels with triglycerides not tested; or 1 arXiv:2212.06630v2[cs.DB] 28 Feb 2023 3. normal CO 2 level and blood potassium level, polys (neutrophils) levels not tested suffer either from chronic pulmonary heart diseases, hypovolemia (decreased blood volume) or hypertension (and vice versa).Information like this -obtained from real-world data with a method proposed in this paper -can be used to guide doctors to best practices (which lab tests should be ordered), to identify particularly strong or weak combinations of tests, or to improve diagnoses.
The problem with the above data analysis is that it is often impossible to do, as the data are confidential.And even if it would be done by a trusted party on the hospital data, sharing the results would still be a problem, as it could lead to leaking private patient information.
The solution to these problems lies in the well-known method of differential privacy (Dwork et al, 2006).By giving researchers differentially private access to the dataset, the data owner can rest assured that they will not be able to infer information about any individual in the dataset, providing much stronger privacy guarantees than standard anonymisation methods.Consequently, the data owner can give researchers differentially private access to the data under a more lenient application process, thus allowing them to experiment with new methods more easily.Also, the results can be shared without fear of leaking sensitive information.
In addition to providing strong theoretical privacy guarantees, differential privacy has the advantage of allowing the use of unchanged, original data via different mechanisms that guarantee the privacy of the subjects in all information obtained and potentially publicly released using these mechanisms.A very important part of this information can be lost using other techniques for preserving privacy.For example, perturbation-based techniques affect the range and values of the attributes (Agrawal and Aggarwal, 2001), projection-based perturbation techniques in addition cause losing information about the meaning and number of the original attributes (Kenthapadi et al, 2013;Liu et al, 2006), and using fuzzy logic-based perturbation techniques potentially obfuscate the numerical boundaries and ranges of the original attributes (Jahan et al, 2014).
On the other hand, developing a differentially-private algorithm for a pattern-based data mining task, such as redescription mining, is perhaps even harder than many other applications of differential privacy, such as classification.A small perturbation can break a pattern, while it would not typically have any major impact on classification.
In this paper, we propose three tree-based differentially private redescription mining algorithms and study their properties.Two algorithms closely follow alternations of the well known CARTwheels algorithm (Ramakrishnan et al, 2004) using differentially private decision trees: the AltExpM algorithm is based on the exponential mechanism (Friedman and Schuster, 2010) and the AltMCMC algorithm uses embedding of differential privacy into decision tree construction (Bai et al, 2017).The third algorithm, denoted MCMCTreePair, is a differentially private redescription mining algorithm that uses a completely novel technique for embedding differential privacy into the construction of pairs of decision trees.
Concretely, we present the following contributions: • A novel methodology for the construction of differentially private redescriptions (extractReds) using a pair of differentially private decision trees.This technique allows using much lower noise levels for differential privacy, and consequently, produces more accurate results.
• A novel algorithm (sampleTreePair) for the creation of a differentially private pair of decision trees using MCMC sampling.This extends the ideas of Bai et al (2017) to simultaneously create pairs of differentially private trees with matching leaves so that they are suitable for redescription construction.
• An adaptation of the alternating procedure for redescription construction (Ramakrishnan et al, 2004) to the setting where differential privacy needs to be preserved (createRedsAlt).
• An adaptation of an existing differentially private decision tree algorithm based on MCMC sampling (Bai et al, 2017) for creating differentially private redescriptions (createDTMCMC).
Our approach utilizes a fixed tree depth and avoids unnecessary computations.
• A similar adaptation of an existing differentially private decision tree algorithm based on the exponential mechanism (Friedman and Schuster, 2010) for creating differentially private redescriptions (createDTExpMech).
We performed a thorough experimental evaluation of the three developed approaches for differentially private redescription mining.The obtained results show that the MCMCTreePair algorithm outperforms the AltMCMC and AltExpM algorithms on smaller datasets (although not always significantly), but AltMCMC and AltExpM significantly outperform the MCMCTreePair algorithm on medium-sized datasets.As one allows more and more redescriptions to be generated during the experiments (which necessitates stricter privacy setting), MCMCTreePair outperforms AltMCMC and AltExpM on an increasing number of medium-sized datasets as well.Additionally, it is possible that AltMCMC and AltExpM do not produce any redescriptions in some runs.This occurs with relatively small probability (≤ 0.1 on our test datasets), but it entirely depends on the quality of the input data, which can only be influenced by the data providers.For this reason, we propose a way to significantly reduce the probability of such failure (to 0 in our experiments).This procedure reduces the overall performance of the AltMCMC and AltExpM algorithms, reducing their advantage compared to the MCMCTreePair algorithm.
To conclude, all three approaches have their advantages and there are scenarios in which they excel.The MCMCTreePair algorithm excels on small, potentially unprepared data, or where a strict privacy setting is required, the robust versions of the AltMCMC and AltExpM algorithms excel on medium-sized or large data, potentially not preprocessed, where a medium or small number of repeated differentially private experiments needs to be performed, and the non-robust AltMCMC and AltExpM algorithms excel on well preprocessed medium-sized or large datasets where very few, or just one, differentially private experiments needs to be performed.
The rest of the paper is organized as follows.As the paper combines two topics, we start with brief primers and reviews on related work of either of them: Section 2 is about redescription mining and Section 3 about differential privacy.The tree-based differentially private algorithms are presented in Section 4, the experimental evaluation in Section 5, and conclusions in Section 6.

Primer on Redescription Mining
Redescription mining aims at finding two different ways to describe almost the same observations.In this paper, the dataset consists of a pair of tables, denoted D L and D R , i.e.D = (D L , D R ).Both tables are over the same entities (rows) but have disjoint collections of attributes (columns).The set of attributes of a table D X is denoted attrs(D X ) while view(a) indicates which table a given attribute a belongs to, so that for any a ∈ attrs(D X ), view(a) = X.
The pair of data tables, together with user-defined constraints such as, in particular, minimum support and minimum accuracy, constitute the input of redescription mining.In this context, a redescription consists of a pair of logical queries, one over either data table, respectively denoted q L and q R .More specifically, a query consists of literals over Boolean, categorical or numerical attributes, combined with logical conjunction and disjunction operators, possibly involving negations.For instance, the following query describes entities, in this case individuals, who either do not suffer from hypertension and reside in California, or are born before 1950.The collection of entities described by the queries is called the support of the query, denoted supp(q).In other words, supp(q) is the set of entities that satisfy the query q.
The main quality of a redescription is that the queries have similar supports.The similarity of supports is typically measured using the Jaccard similarity index.Hence, the accuracy of a redescription (q L , q R ) is defined as In addition to having a high accuracy, to be of interest a redescription should cover neither too few nor too many entities.This is typically ensured by setting minimum and maximum support thresholds.Finally, the size of the intersection of the queries' supports should not be a direct consequence of their respective supports.For instance, if both queries cover almost the entire dataset, it is not at all surprising, and hence not interesting, that they have a large overlap.This is controlled by computing p-values representing the probability that the intersection of the supports of two randomly chosen queries is as large as or larger than observed in the studied redescription.Queries are randomly chosen with a priori probabilities equal to the fraction of entities described by redescription's queries.The assumption is that all entities can be sampled with equal probability.Redescriptions with p-values over a chosen significance threshold (typically 0.01) are then deemed uninteresting.For details about the computation of p-values and a more extensive account of redescription mining, we direct the reader to the work of Galbrun and Miettinen (2018b) and references therein.

Related Work
Since redescription mining was introduced by Ramakrishnan et al (2004), various algorithms have been proposed for the task.Some algorithms are based, for instance, on iteratively growing the queries in a greedy fashion (Gallo et al, 2008;Galbrun and Miettinen, 2012a), or on mining and combining frequent itemsets (Gallo et al, 2008).
Several algorithms, including the pioneering one (Ramakrishnan et al, 2004), are based on decision trees, be it classification trees (Zinchenko et al, 2015), predictive clustering trees (PCT) (Mihelčić et al, 2017) or random forests of PCTs (Mihelčić et al, 2018).These algorithms build the queries by inducing a decision tree over one side of the data, using the current query on the other side or an initialization vector as the target, and alternating between the sides.Different algorithms are obtained depending on precisely how the trees are induced, growing entire trees to full depth at once (Ramakrishnan et al, 2004;Mihelčić et al, 2017), progressively increasing the depth of the trees as in the SplitT algorithm (Zinchenko et al, 2015), or growing subtrees to expand existing branches as in the LayeredT algorithm (Zinchenko et al, 2015).For reasons that will become clear, methods that rely on tree induction are the most relevant to us, and the approach we propose also belongs to this family.
Initially limited to Boolean data, the task was later extended to numerical attributes by Galbrun and Miettinen (2012a) and Zinchenko et al (2015) (following the greedy and tree-based approaches, respectively).

Primer on Differential Privacy
Differential privacy ensures that it cannot be inferred, within a fixed probability, whether or not a particular individual is included in a dataset -irrespective of what other information the adversary might have.This is done by withholding direct access to the data and only returning randomized answers.
Formally, if D and D are two datasets that differ in exactly one row, A is a randomized algorithm operating on these tables, S is any set of potential results A might return, and ε > 0 is the privacy parameter (or budget), then A is ε-differentially private (later also simply ε-private) if where the probability is over the randomness of A (see Dwork and Roth, 2014).This property ensures that a change in one row of a dataset has a limited impact on the output of A. Thus, very little information about the individual can be obtained if a sufficiently strict privacy level is enforced.Two important features of differential privacy are composability and resistance to post-processing.Composability means that if the user runs multiple algorithms on the data (or the same algorithm multiple times), then assuming the source of randomness is independent, the results are still differentially private: if the user runs k algorithms that are ε-private, the results are kε-private (McSherry, 2009) and conversely, if one wants to preserve ε-privacy over k runs, each run must be ε/k-private.Resistance to post-processing means that the results of an ε-differentially private algorithm will stay (at least) ε-differentially private after any post-processing that does not have access to the original data.
Composability has been a topic of intense study in differential privacy.The aforementioned kε-privacy bound cannot be improved in "pure" differential privacy, but if one weakens the privacy definition, one can obtain different trade-off, using, e.g. the composition theorem of Dwork et al (2010) or concentrated differential privacy (Dwork and Rothblum, 2016).These methods are straight-forward to apply to ε-differentially private mechanisms, but they do require knowledge of the acceptable privacy trade-off in the application domain.Therefore, further analysis using these weaker forms of privacy is out of scope for this manuscript.
When considering differentially private algorithms, an important concept is the sensitivity of the underlying functions.Assume that the differentially private algorithm A computes some deterministic function f of the data (e.g.counts a particular type of observations).The sensitivity of f is defined as ∆f = max D,D f (D) − f (D ) 1 , where the maximum is taken over all pairs of datasets D and D that differ in exactly one row.
When A is an algorithm that returns a number (e.g. a counting query), a typical way to ensure differential privacy is to use the Laplace mechanism (Dwork et al, 2006): noise drawn from the Laplace distribution Laplace(0; ∆f /ε) is added to the correct answer.
For more complex queries, the exponential mechanism (McSherry and Talwar, 2007) can be used.Here, instead of adding noise to the answer, the algorithm samples the final answer from the set of all answers.The probability of sampling a particular answer r is proportional to where score D (r) is the quality of answer r on data D and µ is a measure assigning a base probability to r (µ is often uniform).That is, the exponential mechanism requires a way to assign a quality to each answer.The level of privacy, on the other hand, depends on the sensitivity of the quality measure, that is, on how much the quality can change with a change to a single record in the dataset.

Related Work
After the initial formulation by Dwork et al (2006), building on their earlier work on the SuLQ framework (Blum et al, 2005), and the introduction of the exponential mechanism by McSherry and Talwar (2007), differential privacy has seen significant interest in academia, and recent years have brought uptake by the industry as well (e.g.Ding et al, 2017). McSherry's (2009) PINQ is a framework for implementing differentially private data analysis using smaller building blocks.For example, Friedman and Schuster (2010) presented a differentially private ID3 decision tree algorithm using PINQ operations.Since then, differential privacy has been integrated in many fields of data mining and machine learning.Many classification algorithms, such as decision trees (Friedman and Schuster, 2010;Jagannathan et al, 2012;Bai et al, 2017;Gong et al, 2020), random forests (Rana et al, 2015), naive Bayes (Vaidya et al, 2013;Gong et al, 2020), SVM (Rubinstein et al, 2012;Li et al, 2014;Gong et al, 2020), k nearest neighbours (Gursoy et al, 2017), artificial bee colonies, and differential evolution (Zorarpacı and Özel, 2020) have been adapted to preserve differential privacy.Differential privacy has also been used in a semi-supervised setting (Jagannathan et al, 2013) where private data is augmented by publicly available data to create classifiers of improved accuracy.It has been integrated into (distributed) online learning (Gong et al, 2020) and regression algorithms (linear and logistic regression) (Gong et al, 2020).
Various types of neural and deep neural networks have been privatized using differential privacy (autoencoders, convolutional deep belief networks, long short-term networks, and generative neural networks) (Fan, 2020;Gong et al, 2020).Clustering (Nissim et al, 2007;Gong et al, 2020) and dimensionality reduction (Blum et al, 2005;Gong et al, 2020) are the most prominent unsupervised techniques privatized using differential privacy.In addition to PCA, various matrix factorization approaches (Berlioz et al, 2015;Balu and Furon, 2016) and more complex tensor decompositions (Wang and Anandkumar, 2016;Imtia and Sarwate, 2018) have been modified to preserve privacy using methods and techniques developed in the field of differential privacy.The creation of differentially private matrix factorization approaches is largely motivated by their application to the construction of differentially private recommender systems (McSherry and Mironov, 2009), where it can be very important to preserve the full anonymity of users.Differentially private association rule mining (Tsou et al, 2020), frequent itemset mining (Zeng et al, 2012), and frequent subgraph mining (Xu et al, 2016) are examples of the integration of differential privacy with data mining techniques.
After the submission of this work, Karjalainen et al (2023) published a greedy differentially private redescription mining method.Their method follows the ReReMi approach of Galbrun and Miettinen (2012a) and hence, the queries in their redescriptions are not based on decision trees and have a very different structure to the ones presented here.Evaluating the relative benefits of these two approaches is left for future work.

Algorithms
Fully differentially private redescription mining algorithms require parts of the computation to be performed on the secured server (where all data is stored).We use the notation object [cl] to denote an object that is stored in the memory of the client computer and object [ser] to denote an object that is stored in the memory of the secured server (thus not directly available to the client).Similarly, we use function [cl] for functions that are completely computed on the client computer, function [ser] for functions completely computed on the server, and function [inv] for functions that are started on the client but initiate (invoke) some computations on the server, potentially using data from both server and client (client data is sent to the server) and potentially obtaining some results that are differentially private and stored in the memory of the client computer.

Motivation
From the various approaches used to build redescriptions, in this work we focus on decision-tree based methods.Decision trees are easy to interpret, but can express complex relations.Also, building differentially-private decision trees is a well-studied problem (Fletcher and Islam, 2019), which allowed us to build on existing approaches.Mining redescriptions (especially differentially private ones) imposes specific requirements on the decision tree induction algorithms, preventing any existing approach to be efficiently used directly.
Of the different existing approaches, we decided to base one approach on the work of Friedman and Schuster (2010), which presents the creation of differentially private decision trees using a standard exponential mechanism to select the appropriate split at each step of the decision tree induction, and to base the other approach on the work of Bai et al (2017), which implements the exponential mechanism on the level of decision trees of a predefined depth using MCMC sampling.Using MCMC sampling, we do not need to instantiate all trees in the sampling space, but can still benefit from the efficient use of the privacy budget provided by the exponential mechanism.
Next, we will explain how the alternating redescription mining approach of (Ramakrishnan et al, 2004) must be adapted to work in differentially private manner.This approach can use arbitrary differentially private decision tree creation procedures in a direct way to compute differentially private redescriptions.We use modified versions of two differentially private decision tree algorithms (Friedman and Schuster, 2010;Bai et al, 2017) as the building blocks of our algorithms called AltExpM and AltMCMC, respectively.In Section 4.4, we will present our third algorithm, MCMCTreePair, that extends the ideas of Bai et al (2017) to allow sampling of tree-pairs instead of individual trees.In Section 4.5, we explain the process of creating redescriptions using pairs of trees and computing the statistics (support and p-value) for them in a differentially private manner.This part is shared by all tree-based redescription mining algorithms.Section 4.6 explains our redescription pruning strategy, an important step for obtaining accurate differentially private redescriptions in the presence of potentially high levels of noise.
The asymptotic time complexities of the algorithms are analyzed in Appendix D.

Main Building Blocks
All redescription mining algorithms that will be presented in the following sections consist of three main parts: 1. Creation of an initial target variable.
2. Creation of differentially private trees.
Being supervised algorithms, decision trees need a target variable.As redescription mining is unsupervised, we need to create an initial target variable.It is created from an attribute of one of the data tables.For Boolean and categorical attributes, each distinct value forms a target class.Numerical attributes, on the other hand, are discretized using a standard equal-width binning approach, where the number of bins is derived using the Freedman-Diaconis rule (Freedman and Diaconis, 1981).Each bin forms one target class.
After the target variable is created, a process of differentially private tree construction begins.Different algorithms use different mechanisms for building trees, however the main goal of this step is to create a sequence of tree-pairs with matching leaves (see Figure 1).
Pairs of differentially private trees are utilized to obtain differentially private redescriptions, with a specifically developed procedure (see Section 4.5 and Figure 1) that preserves differential privacy.The branch leading to a leaf in a decision tree can straightforwardly be turned into a logical conjunctive query, by combining the tests encountered along the branch (see Figure 1).A simple redescription is obtained by pairing the queries that correspond to a pair of leaves from opposite trees.Such simple redescriptions can be extended with logical conjunctive queries, obtained in Figure 1: (Left) A pair of trees forming a redescription.Solid and dashed lines respectively represent "yes" and "no" branches, while boxes depict the leaves.Numbers in the leaves indicate the entities.They are rendered for illustrative purposes only and very faint to emphasize the fact that they are not visible to the algorithm.The size of the intersection of entities for every leaf pair is indicated by the thickness and shade of the line between the leaves.The algorithm obtains these values using the Laplace mechanism.Redescriptions are obtained from matching paths to pairs of leaves that have a high (noisy) Jaccard index.(Right) Example redescription (q L , q R ) with |supp(q L ) ∩ supp(q R )| = 3 and J(q L , q R ) = 3/4.
either tree from the pair, using disjunctions and negations, to produce more complex, hopefully more accurate, redescriptions.
In the following sections we provide a detailed description of all three main parts of differentially private redescription creation.

Alternation-Based Algorithms
In this section, we describe two differentially private redescription mining algorithms: the first, AltExpM, uses individually created differentially private decision trees based on the exponential mechanism (Friedman and Schuster, 2010) and the second, AltMCMC, embeds the sampling into the construction of the decision trees (Bai et al, 2017).We do not consider differentially private decision trees using noisy counts since prior work shows that these are outperformed by trees constructed using exponential mechanism (Friedman and Schuster, 2010).
The algorithms presented in this section follow a well-known alternation scheme used in the CARTwheels algorithm (Ramakrishnan et al, 2004).In this scheme, the initial tree is created on one table (usually D L ) using a set of initial target labels (derived from data).Consecutive steps use information about entity membership in the leaves of the tree from the previous iteration to create target labels which allow the construction of a matching tree on the other table (D R ).Redescriptions are formed based on the paths from the root of one tree to the root of a second tree.Since decision tree construction cannot match created target labels exactly, each alternation creates slightly different trees, allowing the creation of new redescriptions.
We first present the general framework before explaining how individual trees are built.
Algorithm 1 iterates for InTr iterations (line 3), chooses a random attribute at each iteration and creates an initial target variable using values of the selected attribute (line 4).The target variable creation step is entirely performed on the secured server (the one hosting the data) thus it does not require protection (the end-user does have access to this data).This operation is conceptually similar to the Partition operation, commonly used in differential privacy.This operation groups the data on the server based on some criterion (usually the value of a target label).Since this grouping is performed entirely on the secured server, it does not reveal any information to the end user.Then, the secured server returns the corresponding noisy counts of the number of entities contained in these groups to the end-user (thus privacy is preserved).
Once an initial target variable is constructed, it is used either to construct a differentially private decision tree using the exponential mechanism (see Section 4.3.2) or to perform decision tree construction by sampling the space of decision trees (see Section 4.3.2).The resulting tree is differentially private and returned to the client.Which protection mechanism is chosen depends on the choice of a differentially private decision tree algorithm used during the process of differentially private redescription creation (lines 6 and 8).
At each consecutive step of the loop starting in line 10, the successive targets are created directly from the leaves of a tree constructed in the previous iteration (lines 11 to 17).Each leaf represents one class and entities are assigned a class label based on their membership in a leaf; the labels in turn are used to create a new tree using attributes from the other view.The main idea is to obtain pairs of trees having very similar or identical sets of entities in their leaves.
Target creation is again performed entirely on the secure server and the corresponding data is accessed by the differentially private algorithms using the exponential mechanism or by sampling.The fact that MCMC sampling upon convergence produces trees as would be selected by applying the exponential mechanism to the space of trees of a predefined depth has been proven by Bai et al (2017).The obtained pairs of trees (without information about their node sizes) are used to create redescriptions in differentially private manner using noisy counts, obtained by making two passes over the dataset, to compute noisy redescription support, accuracy and significance (line 19; see also Section 4.5).
To summarize, after creating a target variable (class attribute, see Section 4.2) in the first of RMIter iterations, redescriptions are obtained by iteratively creating pairs of differentially private decision trees with matching leaves and then combining their leaves into redescriptions in a differentially private manner described in Section 4.5.

Types of Differentially Private Decision Trees
We use two different types of differentially private decision trees: differentially private decision trees using the exponential mechanism (Friedman and Schuster, 2010), presented in Algorithm 2, and differentially private decision trees obtained using MCMC sampling (Bai et al, 2017), presented in Algorithm 3.
Algorithm 2 first computes the available budget for every level of the tree (line 2).It uses the exponential mechanism (line 7) to choose the split point out of all possible split points (attributevalue pairs) (see Friedman and Schuster, 2010).Next, it partitions the dataset based on the values of the split point attribute using the Partition operation (line 8) and then recursively calls itself to create the next level of the tree (line 9).Finally, it connects subtrees into the final tree and returns it.
At each level, calls to the exponential mechanism are ε/d differentially private.Applying the exponential mechanism on siblings does not necessitate increasing the privacy level thanks to the property of parallel composition (McSherry, 2009).The Partition function does not spend any budget, as no result is returned to the user.The function calls itself recursively at most d times, thus by the property of sequential composition (McSherry, 2009), Algorithm 2 is d • ε/d = ε differentially private.The parameter q in a call of the exponential mechanism in line 7 represents the Gini index, which has sensitivity 2 (Friedman and Schuster, 2010); the larger sensitivity is taken care of in the exponential mechanism calculations.The exponential mechanism is executed on a secured server by calculating all splitting choices and returning a random solution sampled from a distribution depending on the desired privacy level and the corresponding score function.
Algorithm 2 is based on the algorithm developed by Friedman and Schuster (2010) with the following modifications: a) computations used to determine the noisy counts of entities at each node of a tree are removed and b) trees are built until a predefined depth is reached.The main reason for these modifications is to save budget, and they are justified by the fact that redescription mining algorithms usually use shallow trees (depth ≤ 8).For various reasons (budget savings, reduction Algorithm 2 createDTExpMech: Create differentially private decision trees using exponential mechanism Input: Dataset D [ser] , maximum tree depth d [cl] , privacy budget ε [cl] , level l [cl]  Output: Tree T [cl]  1: if l ← Partition [ser] (D [ser] , val(A) [ser] ) for all i ∈ val(A [ser] ) level , (l + 1) [cl] ) for all i ∈ val(A) [cl]  10: root(T ) [cl] = A [cl] ; subtrees(T ) [cl] = {subtree of false positive redescriptions), this depth is even smaller in differentially private redescription mining algorithms.This allows using ε/d of the budget for the exponential mechanism.Noisy count computations are not required because target variables in the alternation process are created on the secured server (see Algorithm 1) and pairs of differentially private trees without information on target value distribution in their nodes are sufficient to compute differentially private redescriptions (see Section 4.5).
The second approach uses a deep embedding of differential privacy into the decision tree construction (Algorithm 3).The embedding is achieved by sampling the tree space using Markov Chain Monte Carlo (MCMC) (Hastings, 1970).This process has been shown to generate, upon convergence, trees as would be obtained if selected by the exponential mechanism (Bai et al, 2017).The advantage of this approach compared to Algorithm 2 is that it does not have to divide the available budget between the levels of a tree.Instead, it can use the whole budget for sampling.The construction of a differentially private object of this complexity cannot be achieved easily by applying the exponential mechanism once to the vast space of all trees of depth d.
Algorithm 3 is a version of the approach presented by Bai et al (2017), modified by removing the noisy count computation.This allows using the whole available budget ε for tree sampling.The reason why noisy counts are not required is the same as for the createDTExpMech: the initial targets are created on the secure server (see Algorithm 1) and noisy counts are not required as input to the construction of differentially private redescriptions (see Section 4.5).
The work of Bai et al (2017) demonstrates that trees sampled using MCMC outperform trees based on the selection of split points using the exponential mechanism, which strongly motivates the use of the former method in an algorithm for differentially private redescription creation.
The score function for the tree construction is as defined by Bai et al (2017): where C(n) denotes the set of children of the tree node n, while τ , τ n , and τ n,c denote respectively the total number of entities in the dataset, the number of entities in n and the number of entities of class c in n.N l denotes a set of leaf nodes, whereas N i denotes a set of internal (non-leaf) nodes.
The quality of a tree, g(T ) 1 , is computed as the quality of its root node.Algorithm 3 is completely performed on the secured server.It first generates a completely random tree of depth d (line 1), then it iterates for MCIter iterations, selects a random node of Algorithm 3 createDTMCMC: Create differentially private decision trees using decision tree sampling with MCMC Input: Dataset D [ser] , maximum tree depth d [cl] , privacy budget ε [cl] , maximum number of MCMC iterations MCIter [cl] , MCMC iteration termination variance threshold σ [cl]  Output: Tree T [cl]  1: T [ser] ← generateRandomTree [ser] (d [cl] ) < MCIter [cl] and not stabilizedVar [ser] (S [ser] , σ [cl] ) do 4: S [ser] ← S [ser] .append [ser] (g(T ) [ser] 1 ) 5: n [ser] ← randomNode [ser] (Ni(T ) [ser] ) 6: n [ser] ← randomSplit [ser] (randomAttr [ser] (D [ser] )) 7: T [ser] ← replaceNode [ser] (T [ser] , n [ser] , n [ser] ) 9: if random [ser] () < α [ser] then 10: 11: end if 12: i [ser] ← (i + 1) [ser]  13: end while 14: return T [cl]   a tree T and computes the score of a newly constructed tree where all other nodes are equal to the nodes of T but a selected node is replaced with a node corresponding to randomly chosen split point (lines 5 to 7).The quality of the new tree is compared to the old one in a similar fashion as in the exponential mechanism, and if the change is accepted, the old tree is replaced (line 10).If changes in a tree score during the iteration drop below a predefined threshold (determined by the function stabilizedVar), iterations are terminated and the resulting tree is returned to the client.

The MCMCTreePair Algorithm
In this section, we present an algorithm whose main ingredient is MCMC sampling of tree-pairs instead of individual trees.
This algorithm uses the privacy budget more efficiently than the alternation-based algorithms.Assuming one has a budget ε t to construct one pair of trees in a differentially private manner, applying the algorithm described here (that samples pairs of trees) allows using this whole budget ε t to sample tree-pairs.Constructing a tree-pair by first creating single trees necessarily requires using ε t /2 budget to create each tree (regardless of the technique used to obtain individual trees).However, sampling pairs of trees is more complicated than sampling decision trees or split points, thus there is a trade-off between the budget spent and the complexity of the sampling.
The algorithm creates the initial target variable (line 3 in Algorithm 4) as described in Sections 4.2 and 4.3.1.Next, a random pair of decision trees of a predefined depth is created (line 4 in Algorithm 4, lines 1 and 2 in Algorithm 5).This pair is iteratively improved by selecting a random node in either tree and replacing it with a node created by making a split on a randomly selected split point (attribute-category or attribute-numerical value pair) on the data contained in this node (while-loop starting in line 4 in Algorithm 5).Potentially newly gained branches are completed to random subtrees until a predefined depth.As a result, we obtain a pair of decision trees, respectively over the variables from either data tables, which are matched at the leaves through the entities they contain (line 5 of Algorithm 4).
Thus, two important steps are: a) building pairs of matching trees (line 4 of Algorithm 4) and b) evaluating and combining queries corresponding to the branches of the trees to form redescriptions and compute their support statistics (line 5 of Algorithm 4).The main challenge we need to tackle is to perform these steps in a differentially private manner while still being able to produce high-quality redescriptions.
Specifically, in the first step we use an MCMC method similar to the one used in AltMCMC.However, we have to extend the method to simultaneously sample pairs of matching trees.We show that this process, upon convergence, produces a tree-pair as would be produced if the exponential mechanism was applied to the very large space of all possible pairs of trees of a fixed depth d.Thus, upon convergence, we may consider the produced pairs of trees to be differentially private.
In the second step we use noisy counts, i.e. apply the Laplace mechanism, in a targeted manner to compute the support statistics of obtained redescriptions.This guarantees that the returned redescriptions are differentially private.The methodology allows balancing budget expenditures between the two steps.

Sampling a Pair of Trees
The MCMC-based procedure for sampling a pair of trees is described in Algorithm 5.It is an extension of the MCMC sampling procedure for trees developed by Bai et al (2017) and used in Algorithm 3. In addition to the extensions in MCMC sampling iterations, development of the tree-pair sampling procedure necessitated development of a novel tree-pair evaluation function.This necessitated creating a normalized tree evaluation score (5), which is a modified version of the score from (3), and incorporating this modified score into the novel tree-pair evaluation function (6).
The procedure in Algorithm 5 starts by randomly generating trees T L and T R of depth d (lines 1 and 2).Function makeTarget is used to create a target variable from a given attribute or tree as explained above, and function randomSplitTree is used to build a tree of chosen depth by selecting splits at random.Then, in each iteration, a node n is selected at random from the current tree-pair (line 6).A new node n is created, by selecting at random an attribute from the same view as the attribute in n, and a random split value in its range (line 7).
The score for the tree-pair obtained by replacing node n with node n (line 8) is computed and compared to the score of the original tree-pair to decide whether to accept the replacement (line 9).Bad choices for replacement (e.g.internal nodes which affect several leaves negatively) will significantly reduce the score of a newly constructed tree, thus the change will be accepted with smaller probability.
If the replacement is accepted and the modified node is in T L , the statistics of T R must be updated to account for the modified target (line 13).Such node replacement attempts are performed for a predefined number of iterations (MCIter ) or until the variance of the scores in the k previous iterations drops below a predefined threshold, as checked by function stabilizedVar (line 4).The last obtained tree-pair is returned (line 18).
Clearly, the crux of this procedure is the score function to evaluate the quality of pairs of trees, score(T L , T R ), and the replacement probability α that is derived from it.The replacement probability should be such that the procedure is able to traverse the whole space of tree-pairs and, upon convergence, samples a tree-pair (T L , T R ) with the probability assigned to it by the exponential mechanism where AllTreePairs d is the set of all pairs of trees of depth at most d and ∆score is the sensitivity of the score function.Assuming the tree-pairs have uniform base probability, the above probability is clearly in the form of (2).We show that the sampling follows the above probability in Appendix A.
Given a decision tree T and target variable c, we measure the quality of the subtree rooted in node n as the quality of the associated split with respect to the target, using the recursive formula where, as in (3), C(n) denotes the set of children of n, while τ , τ n , and τ n,c denote respectively the total number of entities in the dataset, the number of entities in n, and the number of entities of class c in n.Unlike Bai et al (2017) and in (3), here we use a normalized quality measure that does not depend on the data size.A value of 1 indicates a pure split, i.e. a perfect match between the (sub)tree and the target.Computing g(root(T )) gives the quality of tree T .The score of a pair of trees is measured by combining their respective scores The quality score of tree T R is weighted by the quality score of tree T L since a fairly accurate tree T L must be built first, to ensure that the resulting tree-pair adequately models the initial target.Then, as tree T L becomes more accurate, the overall score depends increasingly on the accuracy of tree T R , forcing both trees to match and to model the initial target.
The proof of the above lemma is presented in Appendix B. Note that the score used by Bai et al (2017) has sensitivity 2 instead.Thus, our score reduces the level of noise needed in the differentially-private mechanism.
To conclude, Algorithm 5 uses the whole initial budget ε for tree-pair sampling.The produced tree-pair, upon convergence of the MCMC sampling procedure, is as if produced by the exponential mechanism applied to the space of all tree-pairs of depth d and can thus be considered ε-differentially private.Sampling tree-pairs is much more complex than sampling individual trees.For this procedure to work, the underlying quality score must be normalized to the [0, 1] interval.Although this allows reducing sensitivity, it also changes the acceptance probabilities of the exponential mechanism (random choices can be accepted with substantially larger probability compared to the non-normalized version).

Generating Redescriptions
Having obtained a pair of matching trees, the next step consists in generating redescriptions from it by evaluating and combining queries corresponding to the branches of the trees.In particular, doing this in a differentially private way requires obtaining noisy counts of the entities shared by the leaves from either tree, from which we can then compute all necessary support statistics.A naïve way to compute the support cardinalities would be to use the Laplace mechanism to obtain the size of the support |supp(q)| for both conjunctive queries of every simple redescription, and Algorithm 6 extractReds: Extract redescriptions from decision trees Input: Dataset with two views (D R , privacy budget ε [cl] , additional redescription constraints Γ [cl]  Output: Set of redescriptions P [cl]  1: L , ε [cl] /2) 4: CntR [cl] ← computeNRLeafCards [cl] (CntInt [cl] ) 5: dSize ← computeDSize(CntL [cl] ) R , CntL [cl] , CntR [cl] ) 7: SReds [cl] ← createSimpleReds [cl] (SQ R , Γ [cl] , CntInt [cl] , dSize) 8: for Rs ∈ SReds do 9: for i from 1 to maxClauses do if J(qsL ∨ p, qsR) > J(Rs) and sat (qsL ∨ p, qsR), dSize, Γ [cl] then 12: Rs ← (qsL ∨ p, qsR) if sat(Rs, dSize, Γ [cl] ) then 20: end if 22: end for 23: return P [cl]   then use the mechanism again for every attempted combination of these queries via disjunctions or negations.This, however, would use too much of the privacy budget, and we show how we can compute the required counts at a significantly lower privacy budget.
Recall that the leaves of a given tree contain disjoint sets of entities.All necessary information to evaluate redescriptions generated from the pair of trees T L and T R with sets of leaf nodes N l (T L ) and N l (T R ), respectively, can be computed using two passes over the data: first computing ) in the first pass over the dataset (line 2 in Algorithm 6), and then computing |supp(n L )| for all n L ∈ N l (T L ) in the second pass (line 3).As the entities in supp(n L ) ∩ supp(n R ) are disjoint over the leaves, learning the (noisy) cardinality of one of these reveals nothing about the rest of the data, and hence we can use the same ε for computing each of these cardinalities.The same holds true for computing |supp(n L )|.
Using the computed noisy cardinalities, we can also compute The procedures described above are used in lines 4 and 5 of Algorithm 6.For I L ⊆ N l (T L ) and The formulae to compute noisy counts described above are used inside the inner for-loop of Algorithm 6.
Based on these differentially private values, we can compute the support sizes of the simple queries corresponding to single branches from the trees, as well as of the extended queries, that combine different branches, or their negations, from either tree.We can also compute the accuracy and the p-value of every query pair, as this only requires knowledge of the support sizes of the queries, their intersection, and the size of the data.
Simple redescriptions, generated by function createSimpleReds, consist of query-pairs (q L , qR ) obtained from (n L , n R ) ∈ N l (T L ) × N l (T R ) (line 6 in Algorithm 6), where qX denotes the simple query associated to n X or its negation.These simple redescriptions are iteratively extended by at most three other simple queries (or their negations) obtained from each view, using the disjunction operator (the use of disjunctions and their properties are described by Mihelčić et al (2018)).An extension is accepted if it increases the Jaccard index compared to the simple redescription (checked by function evaluate) and if it satisfies conditions from Γ [cl] (checked by function sat).The resulting redescriptions after this step can be, for example, Extended redescriptions are often more accurate than simple redescriptions.Such redescriptions also have larger support set size compared to corresponding simple redescriptions.All redescriptions satisfying the constraints from Γ [cl] are added to the redescription set (line 20) that is returned to the user.
When the data contains missing values, we have This is because entities for which the value of the split attribute(s) is missing cannot be sorted into the leaves.In such cases, we propose to estimate the support of negated queries with a heuristic count |supp(¬ q m )| ≈ l∈I L \{m} |supp(n l )|.Detailed justification for this choice is provided in in Appendix C. Procedures computeNLeafInterCards and computeNLLeafCards from Algorithm 6 are ε/2 differentially private, since both procedures apply the Laplace mechanism with budget ε/2, thus Algorithm 6 is ε/2 + ε/2 = ε differentially private.

Redescription Pruning Strategy
The redescription mining process on data where differential privacy must be preserved is significantly harder than the regular task.Redescription mining algorithms have traditionally had a number of parameters that require tuning, usually through experimentation.However, in a setting where differential privacy needs to be preserved, such experimentation is not an option.Thus, constraints applied to filter redescriptions during the mining process need to be loose so that regardless of data characteristics some usable redescriptions can be obtained.This necessitates using relatively low noisy Jaccard threshold, as setting this parameter to very high values might remove many useful redescriptions and does not guarantee removing inaccurate artefact redescriptions.The Support of redescriptions is an important quality measure.In the differentially private setting, we do not know the exact size of the data, and hence cannot set the minimum support threshold exactly.In our experiments, we set a coarse minimum support threshold (see Appendix E.3).Noisy support calculation has the biggest relative effect on redescriptions with small support, having also a noticeable impact on the accuracy of such redescriptions.As the real support of redescriptions increases, the noisy support size and accuracy of redescriptions are relatively closer to the real values.
To keep the mining process general but also allow obtaining usable sets of redescriptions, we apply a redescription pruning strategy after the redescription mining step.The aim of this post-processing step is to use the information about noisy support and noisy accuracy to determine a noisy support threshold that will be used to filter out redescriptions that are probably artefacts from the introduced noise.The easiest way to determine such threshold is to study a noisy support vs. noisy redescription accuracy scatter plot.In a differentially private setting, there will always be a very dense block in the upper left corner of such a scatter plot (low noisy support, very high accuracy).Redescriptions making up such blocks should be pruned as most of them are inaccurate.
Notice that this step can be applied only once the redescription mining step is complete.Although this step drastically reduces the number of noise-induced redescriptions, it is still possible to obtain redescriptions with strongly overestimated accuracy or insignificant redescriptions.However, as we demonstrate in Section 5, the redescription sets obtained after pruning are usable in practice.

The Setup
All presented algorithms are implemented in Java. 1 The experiments were conducted on a server with two 64-core AMD EPYC 2 processors at 2 GHz and with 1 TB of main memory.All experiments used less than 25 GB of main memory.

Datasets
We conducted experiments on seven real-world datasets based on four different types of data.The datasets are selected to cover a variety of data sizes and types, with an emphasis on the kind of data where confidential access would be important.CMS is a family of datasets based on synthetic health insurance claims; NPAS is based on online psychological surveys; MAMMAL contains mammal occurrences and climate information; and MIMIC-III is based on hospital records.CMS, NPAS, and MIMIC-III are examples of sensitive data with different properties, while MAMMAL is a standard dataset in the redescription mining literature.All datasets are publicly available.Basic properties of the datasets are presented in Table 1; further information is available in Appendix E.1.

Baseline Methods
We compared all three differentially private algorithms to two existing general (not privacy-preserving) tree-based redescription mining algorithms, SplitT and LayeredT (Zinchenko et al, 2015;Galbrun and Miettinen, 2018a) using the implementations from the python-clired package. 2 We used only tree-based methods for the comparison to be fair; the purpose of these experiments is to show the effects of differential privacy, not to compare different types of redescriptions.
Parameters Setting the parameters for differentially private algorithms requires some care.Unlike in a normal setting, with differential privacy the user cannot tune the parameters.We provide the parameters used in the experiments, as well as some general guidelines on how to set the parameters when the data is not accessible, in Appendix E.3.

Evaluation Setup
We conducted five types of experiments.First, we studied the convergence properties of the MCMC simulations used to create tree-pairs by the MCMCTreePair algorithm (see Appendix F).This is studied because sampling tree-pairs in this manner has not been previously reported in the literature.Second, we studied the differences between the quality of the redescriptions as reported by the differentially private algorithms (i.e. the noisy Jaccard and p-value) vs. the true values.Here, we used InTr = 1 to have the largest budget for tree creation in AltExpM and AltMCMC.Noise is required to preserve privacy, but it obviously also means that the results the user will obtain will not be entirely accurate.The aim of this experiment was to assess the error introduced by adding noise.
In the next experiment, we compared the redescriptions found with differentially private algorithms to redescriptions obtained by standard (non-differentially private) tree-based algorithms SplitT and LayeredT.This experiment measured the overall quality of differentially private redescriptions against standard redescriptions.Then, we compared the performance of differentially private algorithms among 10 runs with budget 1.0 each (we perform repeated runs to measure variation between executions), on 10 differentially private runs with budget 0.1 each, and on 100 differentially private runs with budget 0.01 each.We compared differentially private algorithms based on whether any redescriptions were obtained in a run, the percentage of significant redescriptions discovered, the distribution of true Jaccard index values among the obtained redescriptions, and the distribution of the absolute difference between the true and reported, noisy Jaccard index values of the redescriptions.Finally, we tested the scalability of the proposed differentially private algorithms.

Effects of Differential Privacy on Redescription Quality
The effects of the added noise on the reported Jaccard indices for the three developed algorithms can be observed in the scatter plots in Figures 2 and 3. Here, the y-axis shows the differentially private Jaccard indices reported by the differentially private algorithms, while the x-axis shows the true Jaccard indices computed with full access to the data.
Because of the noisy counting, the redescriptions with low (noisy) support can have very misleading quality measures.Hence, we recommend that the user prunes out all redescriptions with reported support below a pre-defined threshold (see Section 4.6 for an explanation of the pruning strategy and Appendix E.3 for the thresholds we used).The redescriptions pruned this way are depicted as light grey diamonds.The redescriptions not pruned are the larger blue dots.The colour of the dot indicates whether the true (i.e.not noisy) p-value is above 0.01 (dark blue) or below it (light blue).The user, with no access to the true data, can separate the diamonds from the dots, but cannot separate the dark from the light blue dots.We can see that for all approaches, the scatter plot slowly turns into a straight diagonal line as data size increases.This is due to the fact that noisy counts become more accurate as more data is available.Redescription accuracy on the NPAS dataset is overestimated for all approaches, which is expected for such a small dataset.As it can be seen from Table 2, MCMCTreePair has the largest and most significant correlation between the noisy and true Jaccard indices on the NPAS and MAMMAL datasets, AltExpM has the highest correlation on the MIMIC-III and CMS-2-16 datasets, and AltMCMC on the CMS-2-1, CMS-2-4, and CMS-2-8 datasets.MCMCTreePair produced a smaller number of significant redescriptions on MAMMAL and on the larger datasets.

NPAS
A very important observation is that the pruning strategy works.The grey diamonds to the left of the scatter plots correspond to false positives, that is, to redescriptions that differentially private algorithms return as valid redescriptions, but that have very low true Jaccard index value.On the other hand, there are very few diamonds towards the right, indicating that there are very few false negatives, that is, good redescriptions that are unnecessarily pruned.As the user does not have access to the true data and cannot compute the true Jaccard indices, it is of vital importance that this filtering works.
The first three rows of plots in Figures 4 and 5 show the distribution of the true Jaccard indices, with the number of redescriptions having true Jaccard index essentially zero indicated in the top-left corner.These plots make it clearer that the pruned redescriptions (light grey) are those with lower true Jaccard index values, while the non-pruned ones have higher Jaccard index values.

Comparison to Other Algorithms
The comparison of MCMCTreePair, AltExpM, and AltMCMC to SplitT and LayeredT can be seen by comparing the first three rows of plots of Figures 4 and 5 to the last row.To make a fair comparison, we use only redescriptions with high enough support (non-filtered) for differentially private algorithms.Filtering for SplitT and LayeredT was performed by removing all redescriptions with true support smaller than the minimum support redescription retained for the MCMCTreePair algorithm.
As it can be seen from Figures 4 and 5, differentially private algorithms mostly return a smaller number of redescriptions with the highest accuracy (Jaccard index in interval [0.8, 1.0]) and MCMCTreePair returns fewer redescriptions overall (as noted earlier).One reason why differentially private algorithms might find fewer highly accurate redescriptions, except the obvious reason related to the introduced noise, is the maximum support constraint, that prevents the algorithms from creating redescriptions with (noisy) support larger than 80 % of the dataset size.This bound is meant to prevent the creation of tautological statements and redescriptions describing almost all entities contained in the dataset.While the same bound is used in non-private algorithms, the added noise can lead the private algorithms to erroneously prune some redescriptions.These experiments demonstrate that when the proposed differentially private algorithms report good redescriptions with high quality and high-enough support, the user can be rather confident that she will find good results also from the true data, should she get a permission to study it.

Evaluation of the Properties of the Different Algorithms
Here we provide a in-depth comparative analysis of the different algorithms.We will see that AltMCMC and AltExpM perform very well and significantly outperform MCMCTreePair on mediumsized and large datasets when a very large budget is available or on large datasets with smaller budget.The MCMCTreePair algorithm has the best performance on small datasets or when small budget is available on small and medium-sized datasets.The performance of AltMCMC and AltExpM increases with the increase of the data size.
The AltExpM and AltMCMC algorithms have higher dependence on the initialization procedure than MCMCTreePair.When used in such a way that only one initial attribute is selected to start the alternations (InTr = 1, the setting giving the largest budget for tree construction), a poor choice of initial attribute might produce an extremely bad or even empty redescription set.Since an analysis of the distribution of the attribute values is required to reduce the risk that this happens, these algorithms require either significantly pre-processed data (rarely available in real-world scenarios) or some more involved strategy to choose the initial attribute.Our experimental results show that such critical failure of AltExpM and AltMCMC mostly happens with a probability less than 0.1 (for more details, see Appendix F).
In the absence of a good strategy to choose the initial attribute, one solution is to randomly select a larger number of initial attributes and reduce the number of alternations.Thus, we also test the performance of these two algorithms in a setting at the opposite end of the spectrum, where InTr = k and RMIter = 1, to maximize the chances of choosing one good initial attribute, while keeping the tree-construction budget similar to the previous setting.In this setting, k different initial attributes are chosen and for each one a pair of trees is created from which redescriptions are generated.The probability that algorithms with this parameter setting will choose a good initial candidate is now equal to that of the MCMCTreePair (thus we call this parameter setting stable with respect to the choice of initial attributes).AltMCMC and AltExpM allow different tradeoffs between InTr and RMIter parameters; further study of these tradeoffs is left for future work.
The statistical significance of differences in pooled distribution between approaches is measured using the Mann-Whitney U test (Mann and Whitney, 1947).As it can be seen from Table 3, when an initial budget of 1.0 is used, the AltMCMC and AltExpM algorithms have the largest percentage of significant redescriptions on all datasets.This changes when 10 runs with budget 0.1 are used, where MCMCTreePair has the best results on NPAS, MAMMAL, MIMIC-III, and CMS-2-1 but AltMCMC and AltExpM produce the largest percentage of significant redescriptions on CMS-2-4, CMS-2-8, and CMS-2-16.When 100 runs with initial budget of 0.01 are used, the AltExpM algorithm produces the highest percentage of significant redescriptions out of all filtered redescriptions on all datasets.
A better insight into the overall performance of the presented methodologies can be obtained from Table 4.The methodology with the highest median of true accuracy of filtered redescriptions in a pooled set is compared to all other approaches.The pooled sets contain redescriptions produced in all runs by some algorithm with a chosen initial budget.
These results demonstrate that MCMCTreePair outperforms all methodologies with all values of Table 3: Percentage of significant redescriptions (out of all remaining redescriptions after filtering) produced by each approach on 7 different datasets.4 show that every presented approach has its merits and can be effectively used under different circumstances.The AltMCMC and AltExpM algorithms are best used with high initial budgets and on medium-sized and large datasets, whereas the MCMCTreePair algorithm should be used on smaller datasets and with smaller initial budget.The stable parameter setting often significantly reduces the accuracy of redescriptions produced by the AltMCMC and AltExpM algorithms, but it averts the undesirable outcome where no redescriptions are produced.
Table 5 shows the statistical significance of the absolute difference between noisy and true redescription accuracy.It demonstrates that MCMCTreePair has the lowest difference in the majority of cases on the NPAS, MAMMAL, MIMIC-III, CMS-2-1, and CMS-2-4 datasets, whereas AltMCMC as well as, in most cases, AltExpM have the lowest difference on the larger CMS-2-8 and CMS-2-16 datasets.
The distributions of redescription accuracy and absolute difference between noisy and real redescription accuracy corresponding to the results presented in Tables 4 and 5 are provided in Appendix F.

Scalability
To test scalability, we ran all proposed differentially private algorithms on the CMS-2-k datasets for k = {1, 2, 4, 8, 16}.We run each algorithm 10 times on each dataset.AltMCMC and AltExpM were run with both tested parameter settings (InTr = 1, RMIter = k and InTr = k, RMIter = 1).We report the total CPU time in seconds required to perform redescription mining in Figure 6.The time required to load the input data is excluded to eliminate the effects of I/O and caching.
Algorithms scaled relatively well with data size (even single-threaded) and never used more than 25 GB of main memory.Our experiments show that stabilization increases the execution times of the algorithms.This is the consequence of the fact that a larger number of trees need to be built and that the initial binning of an attribute needs to be repeated multiple times.AltMCMC has the lowest execution times, whereas the stabilized version of AltExpM has the largest execution times.Variation in execution times increases for all approaches with the increase of data size.The increase is expectedly much larger for approaches based on sampling, since the involved random choices of attributes can drastically change the structure of the tree or tree-pair being build, potentially leading to rebuilding larger parts of these trees (see Appendix F for more details).
As it can be seen from Figure 6, all algorithms finish in less than 10 hours on the largest dataset containing 115 200 entities.

Discussion
The performed experimental evaluation shows that the presented differentially private redescription mining algorithms discover useful redescriptions on datasets of various sizes.Due to introduced noise, all presented differentially private approaches produce a number of redescriptions with very inaccurate estimates of the Jaccard index and p-value.Yet, as demonstrated, these can be detected and removed with a simple filter on the noisy support size.Differentially private redescription mining algorithms discover comparable number of redescriptions to the non-differentially private approaches on medium-sized and large datasets, whereas this number is smaller on datasets of smaller size.The overall accuracy of differentially private redescription mining approaches is also reduced compared to non-differentially private approaches due to the noise required to ensure privacy of individual records.
The AltExpM and AltMCMC differentially private redescription mining algorithms perform very well on datasets of medium and large size if: a) a maximal budget is used to perform the experiment (thus only one experiment overall can be performed) and b) the data is carefully filtered and pre-processed to ensure successful initialization.With suitably adjusted parameters, requirement b) can be alleviated or completely removed at the cost of a significant drop in the overall accuracy.With the adjusted parameters, AltExpM and AltMCMC perform very similarly to the MCMCTreePair algorithm.
If the data size is relatively small, or it is required to allow several experiments to be performed on the data, the experimental results show increasing benefits of using the MCMCTreePair algorithm.

Conclusions
Differential privacy offers strong guarantees to protect the privacy of the data of every individual covered by a database.Unfortunately, these guarantees come at the cost of limited availability of information about the data set and reduced accuracy of the obtained results.
Despite this, such systems allow performing exploratory data analysis on privacy-sensitive data with the aim of generating new hypotheses.Preliminary results can be obtained quickly, safely, and cost-effectively, which can inform the decision to request a full access to the data (involving privacy agreements, potential data-use fees etc.).
Pattern-based data mining methods, such as redescription mining, are especially sensitive to the noise required by differential privacy.Designing effective, differentially private pattern mining algorithms is even harder than, say, classification methods.Nonetheless, our experiments show that the proposed differentially private redescription mining algorithms work: they can produce high quality redescriptions many of which are statistically significant.
The proposed methods cannot be used on arbitrary data without care, however.The input data might require pre-processing (by the data owner), algorithmic parameters need to be set, and, most importantly, noisy support filtering must be performed to achieve reasonably good performance.
Approaches based on the MCMC sampling of decision trees and the exponential mechanism (AltMCMC and AltExpM) have the potential of returning very accurate sets of redescriptions with over 90 % of the reported redescriptions (after noisy support filtering) being in fact significant.However, they are highly dependent on the initialization strategy or data preprocessing (which entirely depends on the data provider).Stabilizing these approaches in such a way that more initial attributes are chosen has a negative impact on the accuracy and significance of the produced redescriptions, as well as on the absolute difference between their noisy and true accuracies The MCMCTreePair algorithm produces overall fewer redescriptions and these have lower accuracy compared to the results of AltExpM and AltMCMC when a budget of 1.0 is used.On the other hand, it is more stable, makes smaller errors in estimating true redescription accuracy, and is especially suited for use on smaller datasets or with smaller budgets, both common constraints in many real-world applications.
Tree-based algorithms are only one family of redescription mining algorithms.An interesting topic for future research is to study whether other types of algorithms could work equally well -or even better -than the proposed tree-based differentially private redescription mining algorithms.algorithm accepts the MCMC transition from a pair of trees (T L , T R ) to a pair (T L , T R ) with probability min{1, z/z }, where z = exp ε•score(T L , T R ) 2•∆score and z = exp . This follows the idea described above, thus upon convergence, the tree-pair (T L , T R ) is chosen with probability .

B Proof of Lemma 1
First, note that the sensitivity of counting queries is 1 (see Dwork and Roth, 2014).Recall from Equation 5 that, given a decision tree T and a target variable c, the quality of the subtree rooted in node n is defined as where C(n) denotes the set of children of n, while τ , τ n , and τ n,c denote respectively the total number of entities in the dataset, the number of entities in n, and the number of entities of class c in n.Since τ n,c ≥ 0 for all n and c, we have that c∈c τ 2 n,c ≤ c∈c τ n,c 2 .Furthermore, since Since 0 ≤ τ n /τ ≤ 1 and n∈N l (T ) τ n /τ = 1, the sum of scores of leaves is a convex sum, and hence takes value between 0 and 1, so that g(n) ∈ [0, 1] and g(root(T )) = n∈N l (T ) g(n) ∈ [0, 1], in particular.
Therefore, the measure used to evaluate a pair of trees, defined in (6) as clearly also takes value in the unit interval.Thus, the sensitivity of score(T L , T R ) equals 1.

C Negated Queries in the Presence of Missing Values
When using heuristic noisy counts, |supp(¬ q m )| ≈ l∈I L \{m} |supp(n l )|, every counted entity truly belongs to the support of a query, however the counting process potentially leaves out some entities.Since the negations of leaf nodes can be highly overlapping, computing these counts exactly requires increasing the budget by a factor of 2 d , which results in prohibitive additional noise.This is illustrated in Figure 7. Using the proposed heuristic, setting the Laplace parameter λ = 160 (regardless of the tree depth) results in a distribution where the noise is tightly concentrated around the origin.Using exact counting with tree depth 4, we have to set the Laplace parameter to λ = 2720, resulting in much less concentrated distribution causing ultimately significantly larger noise.Using exact counting with tree depth d = 6 and letting the Laplace parameter λ = 10 400 results in extremely wide distribution, causing prohibitive amount of noise.
It is important to notice that the support size of the negated query depends on the sum of support sizes of all other leaves of the three (which have added noise with a deviation of √ 2 • λ).Thus, the noise accumulates as σ total = l∈I L \{m} σ l and we can assume the very conservative The Laplace distribution used to produce noise is centred around 0 (which is the expected value) thus we can assume more realistically that the expected introduced noise will be closer to σ total ≤ ((2 d − 1) • √ 2 • λ)/2.Given λ = 160 (as above when negations are computed heuristically), for d = 4 we have σ total ≤ 1810 and for d = 6 it follows σ total ≤ 7241.Our simulation experiments show that out of 1000000 samples of size 2 4 − 1 = 15 generated from the Laplace(0,160), only 40741 (4.1 %) had the absolute value of a sum larger than 1810 and only 65 had a sum larger than √ 2 • 2720.The same experiment for samples of size 2 6 − 1 = 63 showed that only 89 samples (0.009%) had the absolute values of a sum larger than 7241 from which 0 had the absolute value of a sum larger than √ 2 • 10400.Thus, we conclude that the use of the proposed heuristic solution is justified in differentially private redescription mining for small and medium-sized data and for data with small or medium amount of missing values.

D Asymptotic Time Complexity
We use D to denote the input data, attrs(D) to denote the set of attributes of the input data, S p the number of splitting points, attribute-value pairs in the data, MCIter to denote the number of MCMC iterations, and MCIter algo to denote the number of MCMC iterations in which the root node of a tree has been changed during execution of the algo algorithm (MCIter algo ≥ 0 and MCIter algo MCIter ).The time complexity of the tree construction with exponential mechanism is O |D| (|attrs(D)|+ |S p |) .The first step is to compute all possible splitting points, which requires iterating over the whole dataset for all attributes.Tree construction, at each level, includes iterating over all splitting points which contain only non-redundant attribute values.For all levels except the root node, the splitting point is evaluated for a subset of data D .When MCMC sampling is used, the time The maximum number of initial trials InTr is at most the number of attributes in the data whereas the RMIter can be an arbitrary number (although it is typically a constant), thus the worst-case time complexity of the approaches are: respectively for AltExpM, AltMCMC, and MCMCTreePair.In practice, the initial number of trials is typically set much smaller than the number of attributes, in order to spend the available budget in a rational manner.
As it can be seen from these analyses, the AltExpM algorithm has the highest worst-case computational complexity.Relating the average time complexity is more difficult, since it depends on the actual data.However, for large data, it is safe to assume that MCIter algo |S p | and (MCIter algo +attrs(D)) |S p |.The MCMCTreePair algorithm has worse computational complexity than AltMCMC since attributes from both data tables need to be used at each trial.The worst case for the AltExpM algorithm is when all attributes have distinct values for all entities in the dataset, then creating a root node requires quadratic computation on the number of entities in the dataset for each attribute of the current view.This is not required in MCMC-based approaches, since split points are chosen at random at each MCMC iteration.

E.1 Dataset Properties and Pre-Processing
We prepared the CMS dataset(s) based on another publicly available dataset, namely the Data Entrepreneurs Synthetic Public Use Files (DE-SynPUF) 3 released by the Centers for Medicare and Medicaid Services (CMS).It consists of realistic synthetic records of beneficiaries and claims information for three consecutive years (2008)(2009)(2010).An entity in the dataset is a (synthetic) beneficiary.As the left-hand side table, we collected the personal information (birth year, sex, race, etc.) and diagnoses received by the beneficiaries over the three years.ICD-9 diagnosis codes were mapped to the second level of the categorization scheme from the Clinical Classifications Software. 4he right-hand side table contains the yearly total amounts in different reimbursement, deductible, co-insurance components, relevant to inpatient, outpatient, carrier, and prescription drug event claims.As a result, one side of the data is mostly Boolean (indicators of different diagnoses) while the other is fully numeric.The original data is divided into 20 samples.We used Sample 2, which we split further into 16 segments based on the first digit of the hexadecimal identifier assigned to each beneficiary.Each segment contains roughly 7200 beneficiaries.We can construct datasets of different sizes by merging segments together.We denote as CMS-2-k the datasets containing the first k segments from Sample 2. For instance, CMS-2-4 contains beneficiaries from Sample 2 whose identifier starts with '0', '1', '2', or '3'.
NPAS is a publicly available5 collection of answers to an online psychological assessment questionnaire.One side of the data contains background information about the respondents (age, education level, etc.) and the other their answers to the questions on the 5 point Likert scale.We removed all answers that contained missing values, since the SplitT and LayeredT baseline algorithms cannot handle them.The NPAS is significantly smaller than CMS datasets, and it also differs from the others in that it contains categorical variables.We use it in part to evaluate the impact of noise on smaller datasets.
The MIMIC-III dataset6 is a prime example of health data where patient privacy is extremely important.It contains de-identified health information about patients who were treated at the Beth Israel Deaconess Medical Center (Boston, MA, US).Using this data, we constructed a dataset containing two views: the diagnoses view and the lab events view.The aim when mining this dataset is to find out what kind of lab events preceded and followed various diagnoses in the studied patients.
The dataset contains 808 Boolean attributes in the diagnoses view (1 indicating positive diagnosis) and 260 ternary attributes in the lab events view.The ternary attributes distinguish between normal lab test results (value 0), abnormal lab test results (value 1) and lab test that were not performed (value −1) for a given patient.With 46 065 entities, it is a medium-sized data.
The MIMIC-III, CMS, and NPAS datasets are examples of datasets that contain potentially sensitive information.Our last dataset, MAMMAL, is a collection of records about the distribution of land mammal species (Mitchell-Jones et al, 1999) and climate (Hijmans et al, 2005) in Europe.This information is hardly sensitive, but we include this dataset as a point of reference, because it has been used in experiments in several redescription mining articles (see, e.g.Galbrun and Miettinen, 2012a,b;Zinchenko et al, 2015;Kalofolias et al, 2016;Mihelčić et al, 2017;Galbrun and Miettinen, 2018a).

E.2 Filtering NPAS Data
NPAS is a relatively small dataset (just 17 and 47 attributes for 1221 entities) containing attributes of mixed type in the first view and numeric attributes in the second view.In order to obtain any useful redescriptions despite the introduced noise of the differential privacy mechanisms, we need to make sure the input data is properly prepared.Since this step can be performed on the side of a data provider, pre-processing and filtering data does not break any privacy constraints and the resulting redescriptions are differentially private.
As a first filtering step, we removed all attributes containing only one value.Such attributes are useless in every data mining task, but are especially harmful in a scenario where differential privacy needs to be preserved.Using such attributes not only wastes budget but also causes the creation of many redescriptions with empty support, which are neither detected nor removed due to the introduced noise.
As a second step, we merged some values of categorical attributes.Similarly to useless attributes, categorical values that occur very rarely can also cause the creation of many redescriptions of very poor quality.For a given attribute, we merged categorical values occurring in less than 50 entities.As a consequence, the new category, category new , must be interpreted as category old1 ∨ category old2 .
Redescriptions with smaller support are more affected by the introduced noise.This is expected, since the level of noise introduced depends on the algorithm parameters and not on the data size.For this reason, it is substantially harder to obtain a good differentially private redescription set on datasets containing fewer than 2000 entities.

E.3 Parameters Used in Experiments
The maximum tree depth d affects mostly the complexity of the resulting redescriptions and depends on the application.We used maximum depth of 4 for all experiments.This depth allows creating interpretable redescriptions (with queries containing ≤ 4 attributes) and keeps the level of noise required to obtain negated queries at reasonable levels.All algorithms use the Gini coefficient as the splitting criterion.
The minimum support of the redescriptions should be set small enough to find all interesting results, and depends somewhat on the data size.In the differentially private setting where information about the data is unavailable, the preferable setting is ≥ 10 (since this will remove redescriptions with very small or very inaccurate support).If one knows that the used data is large (regardless of the real number of entities) this threshold can be increased to 100 which will reduce even larger number of redescriptions that are very sensitive to noise.The noisy counting also means that the Jaccard index computation can be very inaccurate for redescriptions with very small support.Hence, we mined redescriptions using the same relatively low minimum support for all algorithms (the only knowledge we used is if the data is small or medium/large which does not give out any sensitive information) and then pruned away those redescriptions returned by the MCMCTreePair that had too small (noisy) support.The minimum support for the redescriptions was 10 for MAMMAL and NPAS and 100 for MIMIC-III and CMS.In post-processing, we removed all redescriptions from the MCMCTreePair that had noisy support smaller than 2000 for MIMIC-III, 1000 for CMS, 500 for MAMMAL and 200 for NPAS.Obtaining proper post-processing filtering threshold is explained in Section 4.6.
All algorithmic parameters used to perform the experiments are listed in Table 6.AltMCMC and AltExpM have additional parameter RMIter .For these two algorithms we perform two experiments, the first uses InTr = 1 and RMIter equal the number of initial trials from Table 6 and the second where InTr is as reported in Table 6 and RMIter = 1.The second setting ensures equal probability of selecting a satisfactory initial attribute for all approaches.
One should use a high number of MCMC iterations (MCIter parameter) to ensure the obtained tree-pairs appropriately mimic the probability distribution of the exponential mechanism.We used 10 000 MCMC iterations as a relatively high number for shallow trees (depth ≤ 4).Determining the exact number of iterations for general trees requires knowledge about the depth of the tree, which depends on data size and experimental evaluation (Bai et al, 2017), neither of which is possible in the differentially private setting.The MCMC iteration termination variance threshold σ terminates MCMC iterations if the variance of the MCMC score in the previous k iterations is smaller than this threshold (there is little change in the score, thus we may consider the process to have converged);

F.1 Convergence Properties of MCMC Iterations Inside MCMCTreePair
The MCMC iterations inside the MCMCTreePair algorithm are repeated until they reach a predefined threshold (100 000 in this experiment, 10 000 in all other experiments) or if the variance of the last 500 scores is less than the predefined variance threshold (0.0005 in this experiment, 0.005 in all other experiments, i.e. 0.05 % and 0.5 % of the tree-pair quality score range, respectively).If either iteration or variance thresholds are reached, the MCMC iterations are terminated and the last obtained pair of trees is returned.
Here, we inspect the distribution of tree-pair scores achieved during the MCMC iterations with ε = 10 and ε = 0.0001.Intuitively, using a larger budget should result in more accurate tree-pairs.Indeed, this happens, but manifests as a higher probability to obtain a tree-pair with higher score rather than by obtaining much larger scores on average.This occurs because values of the used score are in the unit interval, and the exponential mechanism gives very high probabilities even to tree-pairs of very low quality.As a result, during tree-pair construction good split points will generally be chosen, but bad choices will occasionally be incorporated into tree-pair construction as well.This results in an almost periodic behaviour of the quality score (see Figure 8).
Using a larger budget indeed increases the overall score of tree-pairs.Histogram in 8 (center) indicates slightly better behaviour for ε = 10, and indeed the average scores (± standard deviation) a 0.668 ± 0.060 for ε = 10 and 0.657 ± 0.058 for ε = 0.0001.Using a one-sided Mann-Whitney test, we determined that scores achieved using the higher budget are significantly larger (p < 2 • 10 −16 ) than scores achieved using the lower budget.Thus, indeed, it is more probable to obtain a good solution using a higher budget.This is reflected in the accuracy and the number of produced redescriptions (compare Figures 9 and 11).

F.2 Comparison of Differentially Private Redescription Mining Algorithms
Percentages of successful executions (returns any redescription) for all differentially private approaches are presented in Tables 7 and 8.As it can be seen from these results, the AltMCMC and AltExpM fail with probability up to 0.13 when 100 runs are used (which should provide relatively stable statistics).Parallel boxplots in Figure 9 show pooled distributions of redescription accuracy for filtered, significant redescriptions obtained across 10 executions with budget 1.0.Regular setting denotes InTr = 1 and, RMIter = k while stable (s) denotes setting InTr = k and RMIter = 1.The second setting is denoted stable, since there is substantially higher probability for the AltMCMC and AltExpM with these parameters to choose at least one initial attribute that will yield creation of redescriptions.We also present two different executions of the MCMCTreePair, with the budget redistribution weight ω = 0.1 and ω = 0.5 (balanced, b).Distributions of absolute difference between noisy and real redescription accuracy (|J noisy (R) − J real (R)|) in these settings can be seen in Figure 10 (lower numbers are better since they denote smaller variation between the noisy estimate and the real accuracy of produced redescriptions).Perhaps the most noticeable thing from Figures 9 and 10 is a huge drop in redescription accuracy that occurs after stabilization of alternation-based approaches.Expectedly, as initial budget is lowered, the overall redescription accuracy decreases whereas the absolute difference between noisy and real redescription accuracy increases (see Figures 11,13,12 and 14).This is the result of the increased noise in all algorithms, however as it can be seen from the results, the MCMCTreePair is the most resilient to the added noise from the 3 presented approaches.

F.3 Execution Time Analyses
Algorithm execution times are presented in Figure 15.
As it can be seen from Figure 15, the AltMCMC algorithm has the smallest executions time, followed by the AltExpM (with similar execution time as the MCMCTreePair).As data size increases, so that variability in the execution times of MCMC-based approaches.The AltExpM execution time across different runs changes only slightly on larger data.
As already explained, the main cause of variability is the that depending on the initial randomly AltMCMC AltMCMC (stable)

Figure 2 :Figure 3 :
Figure2: Impact of differential privacy on the quality of redescriptions for the three proposed differentially private algorithms.Reported vs. true Jaccard indices for AltMCMC (top row), AltExpM (middle row), and MCMCTreePair (bottom row).Light gray diamonds are redescriptions that are filtered out due to small noisy support.Dark blue dots are insignificant redescriptions and lighter blue dots are significant.

Figure 4 :Figure 5 :
Figure 4: Impact of differential privacy on the quality of redescriptions and comparison to other algorithms using the NPAS, MAMMAL, and MIMIC-III datasets.Distribution of true Jaccard indices for AltMCMC (top row), AltExpM (second row), and MCMCTreePair (third row) and for the other non-differentially private algorithms (bottom row).In the top three rows, light grey bars correspond to redescriptions that are pruned due to low noisy support, dark blue bars to statistically insignificant redescriptions, and light blue bars to significant redescriptions.For the bottom row, the greyed bars indicate insignificant redescriptions and coloured significant ones.The corresponding numbers are given at the top.Where relevant, the numbers in the top-left corner of bar plots indicate the total number of redescriptions that have true Jaccard index value essentially 0.

Figure 6 :
Figure 6: Runtime vs. data size.Average runtime in total CPU seconds across 10 runs.

Figure 7 :
Figure 7: Laplace noise distributions.The expected Laplace noise distribution used to compute base query support sizes when 40 iterations are used and trees of maximal depth 4 or 6 are built.Left: Illustration with full y-range.Right: Zoomed-in view.The λ parameter, that determines the standard deviation √ 2 • λ of this distribution, equals 4 • 40 = 160 to compute queries when negations are computed heuristically (in the presence of missing values), 4 • 40 • 17 = 2720 for queries (if equal budget distribution is used) when they are computed exactly using trees of maximal depth 4 and 4 • 40 • 65 = 10400 when they are computed exactly using trees of maximal depth 6.
complexity is O |D| (MCIter AltMCMC + 1 + |attrs(D)|) .The first step in tree construction is to create splitting points.The next step is to create a random tree of depth d, which can be done in time O(|D|).Then MCIter iterations iteratively change the input tree.Iterating over all entities of the dataset is required in only MCIter iterations.The time complexity of the tree-pair construction in MCMCTreePair is O |D| (MCIter MCMCTreePair + 1 + |attrs(D L ∪ D R )|) .Constructing tree-pairs necessitates creating all splitting points for both data tables.First, a random pair of trees of depth d is created (doable in O(|D|) time); next, MCMC iterations iteratively change the input pair of trees.Iterating over all entities of the dataset is required in only MCIter MCMCTreePair iterations.Redescription generation has an average time complexity of O(|D|) and worst-case time complexity of O(|D| 2 ).Since InTr and RMIter represent constants for average time complexity, the overall average time complexity of the presented redescription mining algorithms are equal to the complexity of underlying tree or tree-pair construction: O |D| • (|attrs(D)| + |S p |) for AltExpM algorithm, O |D| • (MCIter AltMCMC + 1 + |attrs(D)|) for AltMCMC and O |D| • (MCIter MCMCTreePair + 1 + |(attrs(D L ∪ D R )|) for MCMCTreePair.

Figure 8 :
Figure 8: Value of MCMC scores across MCMC iterations.The scores achieved during MCMC iterations of tree-pairs construction of the MCMCTreePair algorithm using the same initial targets with budgets ε = 10 (left) and ε = 0.0001 (right), for the first, mid and last 1000 iterations of 100 000 total iterations.Corresponding histograms of the scores (center).

Figure 15 :
Figure 15: Runtime vs. data size.Runtime in total CPU seconds.The shaded area indicates one standard deviation above and below the average.

Table 1 :
Properties of the datasets

Table 2 :
Spearman correlation coefficient (ρ) and statistical significance (p) of correlation between noisy and real Jaccard index for pruned redescriptions reported by different approaches.
and C2_16 -CMS-2-16.Approaches with initial budget of 1 and 0.1 are run 10 times.Approaches with initial budget of 0.01 are run 100 times.Percentages are computed on the pooled redescription sets.The best results with initial budget of 1, 0.1, and 0.01 are highlighted."S"denotes the stable parameter setting and "b" the balanced redistribution weight.initialbudget on the NPAS dataset.Due to the small size of this dataset, the difference is not always significant when a smaller initial budget is used.As data size increases, AltMCMC and AltExpM have significantly better accuracy with a budget of 1.0, but already with budget of 0.1, MCMCTreePair outperforms other approaches on all datasets up to CMS-2-8.On relatively large datasets, namely CMS-2-8 and CMS-2-16, AltMCMC and AltExpM significantly outperform MCMCTreePair if they are not stabilized, whereas MCMCTreePair outperforms other algorithms with lower initial budgets (0.1 or 0.01) when other approaches are stabilized.Results presented in Table

Table 7 :
Number of executions (out of 10) returning at least one redescription when InTr = 1 and initial budget ε = 0.1.

Table 8 :
Number of executions (out of 100) returning at least one redescription when InTr = 1 and initial budget ε = 0.01.