A framework for dependency estimation in heterogeneous data streams

Estimating dependencies from data is a fundamental task of Knowledge Discovery. Identifying the relevant variables leads to a better understanding of data and improves both the runtime and the outcomes of downstream Data Mining tasks. Dependency estimation from static numerical data has received much attention. However, real-world data often occurs as heterogeneous data streams: On the one hand, data is collected online and is virtually infinite. On the other hand, the various components of a stream may be of different types, e.g., numerical, ordinal or categorical. For this setting, we propose Monte Carlo Dependency Estimation (MCDE), a framework that quantifies multivariate dependency as the average statistical discrepancy between marginal and conditional distributions, via Monte Carlo simulations. MCDE handles heterogeneity by leveraging three statistical tests: the Mann–Whitney U, the Kolmogorov–Smirnov and the Chi-Squared test. We demonstrate that MCDE goes beyond the state of the art regarding dependency estimation by meeting a broad set of requirements. Finally, we show with a real-world use case that MCDE can discover useful patterns in heterogeneous data streams.


Introduction
The discovery of relationships between attributes is fundamental to many Data Mining applications, e.g., Feature Selection [32], Clustering [31] or Outlier Detection [21], and it is a prominent topic in the database community [6,18,46].Identifying groups of dependent features is critical to deal with high dimensionality and helps to filter out the features irrelevant for the task at hand.One typically estimates the 'relevance' of a feature by estimating its 'dependence' on a given target to predict.To do so, one often leverages well-known 'dependency estimators' such as the Pearson correlation coefficient, or Mutual Information [37].
Most dependency estimators only deal with numerical data, and one assumes that all relevant observations are available during estimation.In the real world, however, data often consists of an open-ended, ever-evolving stream of measurements or indicators of various types, e.g., numerical, ordinal, or categorical observations.Such data sources are known as Heterogeneous Data Streams (H-DS) [45].Knowledge Discovery in H-DS is known as one of the most challenging problems in Data Mining [8].The streaming nature of the data constrains system design in several ways, laid out by Domingos and Hulten [9] as follows: -Efficiency (C1) The system must spend a short constant time and a constant amount of memory to process each incoming record.-Single Scan (C2) The system may perform at most one scan over the data since access to past observations is often unavailable or impractical.-Adaptation (C3) Whenever the data distribution changes over time-a phenomenon that is known as 'concept drift' [1]-the system must adapt, e.g., by forgetting outdated information.-Anytime (C4) The system must be available at any point in time, with an output ideally equivalent (or nearly identical) to the one of a non-streaming system, operating without the streaming constraints.
Most approaches designed for streams can only handle numerical data.Since data may be of other types as well, heterogeneity is on our list of constraints: • Heterogeneity (C5) The system must handle not only numerical types but ideally all data types such as strings, categories, ordinal values.
That being said, and orthogonally to the streaming setting, modern dependency estimators have their own desirable features: -Multivariate (F1) Bivariate measures only apply to two entities (i.e., variables, vectors).Estimating the dependency between more than two entities is useful as well, but existing attempts to generalise bivariate measures lack efficiency or effectiveness [43].-General-purpose (F2) Dependency estimators should not restrict to specific types of dependencies.Otherwise, they may miss relevant attribute relationships.Existing multivariate estimators are typically limited to, say, monotonous or functional dependencies.-Intuitive (F3) A method is intuitive if its parameters are easy to set, i.e., users understand their impact on the estimation.Existing solutions tend to have unintuitive parameters, and the suggestion of 'good' parameter values happens (or does not happen) at the discretion of the inventors.-Non-parametric (F4) Since real data can exhibit virtually any kind of distribution, it is not reasonable to use measures relying on parametric assumptions.The risk is to miss relevant effects systematically.-Interpretable (F5) The results of dependency estimators should be interpretable.In particular, the returned estimate should have a maximum and a minimum, so that one can interpret and compare two given estimates.-Sensitive (F6) Dependency estimation is not only about detecting the existence of a relationship, but also about quantifying its strength.Data points generally are observations sampled from a potentially noisy process.The same dependency should get a higher score when observed with more observations, as the size of the observed effect-the 'effect size'-is larger.-Robust (F7) Real-world data may be of poor quality.Measuring devices often have limited precision, so that values are rounded or trimmed, leading to points with the same values.It is also common to discretise attributes, for a more compact representation.Such artefacts can have a negative influence on the estimation.Thus, estimators need to be robust against duplicates and imprecision.
To our knowledge, no existing solution satisfies all these requirements.In this paper, we propose an approach that fulfils them all.

Contributions
We present Monte Carlo Dependency Estimation (MCDE), a framework which satisfies both the constraints of heterogeneous data streams and the desirable features of dependency estimation.Over a given time window, MCDE estimates the dependency of an attribute set as the average discrepancy between marginal and conditional distributions, via Monte Carlo (MC) simulations.In each MC simulation, MCDE applies a condition on each attribute.Then a statistical test quantifies the discrepancy between the marginal and conditional distributions.We determine a lower bound for the quality of our estimates, which only depends on the number of MC simulations.Such bound allows users to trade estimation accuracy for a computational advantage.
We explore three instantiations of MCDE, i.e., three new dependency measures, dubbed Mann-Whitney-P (MWP), Kolmogorov-Smirnov-P (KSP) and Chi-Squared-P (CSP), which base on the corresponding statistical test.We show that using them in combination allows dealing with heterogeneous data.We describe their implementation and compare them in our experiments.
We introduce index structures for MWP, KSP, and CSP, to speed up contrast estimation.Our indexes support insertion/deletion operations for efficient estimation in streaming settings, e.g., over a sliding window.
We feature a use case against real-world data from Bioliq Ⓡ , a pyrolitic power plant [33], and show how one can leverage MCDE to discover interesting and useful patterns.We release our source code and experiments on GitHub,1 with documentation to ensure reproducibility.
This work extends our previous study [14].A first novelty is that we propose instantiations based on further statistical tests, i.e., KSP and CSP.Next, we explicitly address the H-DS setting: The streaming index operations and our real-world use case are new in this article.
Outline Section 3 reviews the related work.Section 4 describes MCDE and its instantiations as MWP, KSP and CSP and introduces our index operations for the streaming setting.Section 5 evaluates our approach and presents our use case.Section 6 concludes.We refer the reader to [14] for formal proofs and comparison with other methods.

Related work
Estimating correlation has been of interest for more than a century.Many bivariate measures exist, e.g., [35,38].Some of them also target at quantifying the association between two vectors which are possibly multivariate [2,17,25,39].However, they can only quantify the dependency between two entities-not between several ones (F1).They also may have other drawbacks.The Pearson correlation coefficient, for instance, is parametric (F4), targets at linear dependencies (F2) and is only applicable to numerical data (C5).
There are attempts to extend bivariate dependency measures to the multivariate case.Schmid and Schmidt [36] propose an extension of Spearman's to multivariate data, but it is limited to monotonous relationships (F2).Several authors also propose multivariate extensions of Mutual Information [40].For example, Interaction Information [27] quantifies the 'synergy' or 'redundancy' in a set of variables.Similarly, Total Correlation [44] quantifies the total amount of information.However, information-theoretic measures are difficult to estimate, as they require knowledge about the underlying probability distributions.Density estimation methods, either based on kernels, histograms or local densities, all require to set unintuitive parameters (F3) and may be computationally expensive (C1).Next, with many attributes, density estimation becomes meaningless due to the curse of dimensionality [3].Information-theoretic measures also are difficult to interpret (F5), since they usually correspond to a number of bits or nats, which is theoretically unbounded.
More recently, Cumulative Mutual Information [31], Multivariate Maximal Correlation [30], and Universal Dependency Score [28] were proposed as multivariate dependency measures.They are remotely related to concepts from information theory, as they rely on the so-called Cumulative Entropy [7].However, these measures are computationally expensive (C1) and unintuitive (F3).They also are difficult to interpret, because their theoretical maximum and minimum vary with the number of attributes (F5).

3
Distributed and Parallel Databases (2021) 39:415-444 Another approach, High Contrast Subspaces (HiCS) [21], is somewhat similar to ours.It uses subspace slicing as a heuristic to quantify the potential of subspaces to contain outliers.Nevertheless, HiCS only addresses static numerical data, and is not suitable as a dependency estimator.
The current state of the art to handle heterogeneity (C5) is to rely on discretisation, using methods such as the one proposed in [12].Then one can compute an information-theoretic measure, as in [29].However, any discretisation inevitably results in an information loss and may not work as dimensionality increases.
A new line of work focuses on estimating Mutual Information on numerical data streams.MISE [22] is a data summarisation technique to estimate Mutual Information over arbitrary time windows.Vollmer et al. [42] provide dynamic data structures to maintain Mutual Information over a sliding window.Vollmer and Böhm [41] extend this method to propose an anytime estimator for Mutual Information.However, the resulting estimates inherit the qualities and caveats of Mutual Information.

The MCDE framework
Dependency estimation determines to which extent a relationship differs from randomness.In this spirit, MCDE quantifies dependence, i.e., an extent of independence violation, based on marginal and conditional distributions.

Preliminaries
An H-DS is a set of attributes/dimensions/variables D = {s 1 , … , s d } and an open list of observations B = ( 1 , 2 , … ) , where i = ⟨x ij ⟩ j∈{1,…,d} is a vector of values with d attributes, and we see an attribute s j = (x 1j , x 2j , … ) as an open list of values.Since the stream is virtually infinite, we use the sliding window model: At any time t > 1 , we only keep the latest w observations, W t = t−w , … , t .We assume, without loss of generality, that observations arrive at equidistant time steps.Note that one could easily adapt our methods to other summarisation techniques, such as the landmark window or reservoir sampling [16].
We call a subspace S a projection of the current window W on d ′ attributes, with S ⊆ D and d ′ ≤ d .To formalise our framework, we treat an attribute s i ∈ D as a random variable X s i .We also make the distinction between numerical, ordinal and categorical attributes: -We say that s i is of numerical type ( s i ∈ Num ) if one can see X s i as a continuous variable on a given interval.-We say that s i is of ordinal type ( s i ∈ Ord ) if one can see X s i as a discrete vari- able, i.e., it can take a finite number of ordered values.-We say that s i is of categorical type ( s i ∈ Cat ) if one can see X s i as a categorical variable, with a fixed number of nominal categories.
Naturally, knowing whether a given attribute is of numerical, ordinal or categorical type requires domain knowledge.Typically, ordinal attributes have many tying values, while values from a numerical attribute are unique, given enough precision.On the other hand, values from categorical attributes might not be numeric and do not have any meaningfully ordering.p(X) is the joint probability distribution function (pdf) of a random vector X = ⟨ X s i ⟩ s i ∈S , and p(X) denotes the empirical estimation of this distribution.We use p s i (X) and ps i (X) for the marginal pdf and its estimation for each variable s i .P(S) is the power set of S, i.e., the set of all attribute subsets.For any subset S � ∈ P(S) , its random vector is X S � = ⟨ X s i ⟩ s i ∈S � , and its complement random vector is In our algorithms, ' ⊕ ' and ' ∧ ' stand for concatenation and element-wise logical conjunction.

Quantifying dependency via contrast
A set of variables is independent or uncorrelated if and only if all variables are pairwise mutually independent.By treating the attributes of a subspace as random variables, we can define the independence assumption of a subspace: Definition 1 (Independence Assumption) The independence assumption A of a sub- space S holds if and only if the random variables {X s i ∶ s i ∈ S} are mutually inde- pendent, i.e.: Under the independence assumption, the joint distribution of subspace S is expected to be equal to the product of its marginal distributions.We can define a degree of dependency based on the degree to which A does not hold: Definition 2 (Degree of Dependency) The degree of dependency D of a subspace S is the discrepancy, abbreviated as 'disc', between the observed joint distribution p o (X) and the expected joint distribution p e (X): The discrepancy is a random variable.While one can estimate it between two probability distributions, for instance, using the Kullback-Leibler divergence, this is not trivial here because p o (X) and p e (X) are a priori unknown.We work around this as follows: Lemma 1 The independence assumption A of subspace S states that the joint distri- bution for all S ′ ⊂ S is equal to its conditional distribution on S ⧵ S ′ : (1) Distributed and Parallel Databases (2021) 39:415-444 See [14] for proofs of this and the following lemmas.
Lemma 1 provides an alternative definition of A .However, it still has issues: First, multivariate density estimation is required to estimate p(X S � ) and p(X S � |X S � ) with |S ′ | ≥ 1 .Second, even if one could estimate p(X S � ) and p(X S � |X S � ) , estimating the densities for all S � ∈ P(S) is intractable.We instead relax the problem by consid- ering only subspaces with |S � | = 1 , i.e., we only look at the marginal distribution of single variables.

Definition 3 (Relaxed Independence Assumption)
The relaxed independence assumption A * of a subspace S states that the marginal distribution p s i (X) of each variable s i ∈ S equals p s i (X|X s i ) , i.e., the conditional distribution of s i : Lemma 2 (Independence Assumption Relaxation) A(S) ⇒ A * (S) , i.e., we can relax A into A * for any subspace S.
Loosely speaking, the relaxed independence assumption holds if and only if the values of all variables but s i do not reveal any information on s i .
Next, A(S) ⇒ A * (S) , then ¬A * (S) ⇒ ¬A(S) , i.e., showing that A * does not hold is sufficient but not necessary to show that A does not hold.Thus, we can define a relaxed degree of dependency D * of a subspace S, namely the discrepancy disc of the observed marginal distribution p o s i (X) and the expected one p e s i (X) .Under the relaxed independence assumption A * , we have p e s i (X) = p o s i (X|X s i ) .We define D * as the expected value [.] of this discrepancy:

Definition 4 (Relaxed Degree of Dependency)
This definition includes a whole class of dependency estimators, e.g., [21], which aim at quantifying the so-called contrast of a subspace.D * -or contrast-is a vari- ant of D which is much easier to estimate: First, it relies on the comparison of mar- ginal against conditional distributions, i.e., multivariate density estimation is not required.Second, the number of degrees of freedom of A * (S) increases linearly with |S|, but exponentially for A(S) .Thus, estimating D * instead of D is in line with the strict efficiency requirements of data streams.
By definition, D * does not take the dependency between multivariate subsets into account, but only of each variable versus all others.However, we argue that this relaxation is not problematic, and it even supports interpretability.In real-world scenarios, the detection of dependency is only interesting as long as we can observe effects w.r.t. the marginal and conditional distributions: one is typically looking for interpretable influences of particular variables on the system and vice versa [19].

Estimating conditional distributions
The difficulty when estimating D * is estimating the conditional distributions, because the underlying data distributions are unknown.As proposed in [21], one can simulate conditional distributions by applying a set of conditions to S, in a process called subspace slicing.We handle heterogeneity by differentiating between numerical, ordinal and categorical attributes: where l j , u j is a continuous interval, l j … u j is a discrete interval, and v j ∶ v j ∈ s j is a set of values of s j .w ′ < w is the number of observations per condition.We call s i the reference attribute, the only attribute without a condition.We write that k ∈ c i if k fulfils all the conditions in c i .We define c i as the complementary slice, i.e., it contains all observations which are not in c i .p s i |c i (X) and p s i |c i (X) denote the condi- tional distribution of the observations in the slice c i and its complement c i respec- tively.P c (S) is the set of all possible slices in subspace S.
We choose each condition in a slice randomly, but so that they contain w ′ obser- vations.Note that ordinal and categorical attributes (e.g., gender) may have many tying values.In such a case, a random condition might not precisely have w ′ ele- ments.Our solution is to take a random condition containing at least w ′ observations and remove elements from the condition until only w ′ observations remain.
We set w � = � w �S�−1 √ � with ∈ (0, 1) , so that, under the independence assump- tion, the expected share of observations in the slice equals .As a result, subspace slicing happens in a dimensionality-aware fashion.When is a constant, the expected number of observations per slice does not change between subspaces with different dimensionalities.Thus, subspace slicing is a dynamic grid-based method based on the dimensionality of the subspace.
Under the A * -assumption, the conditional distribution p s i |c i is equal to the mar- ginal distribution p s i , for any attribute s i and slice c i .For brevity, we omit '(X)' in p s i (X) and p s i |c j (X) in the following.

Discrepancy estimation
In reality, one only has a limited number of observations, i.e., one only has access to empirical distributions.A solution is to use a statistical test T : However, the number of observations is finite, and the observations that we use to estimate ps i |c i are part of the ones used to estimate ps i so far.This is problematic, as statistical tests assume the samples to be distinct.Plus, when ≈ 1 , ps i |c i converges to ps i , i.e., the two populations are nearly the same.Conversely, ≈ 0 yields spu- rious effects, since the sample from ps i |c i then is small.We solve the problem by observing that p s i |c i and p s i |c i , the conditional distribution from the complimentary slice c i , are equal under A * .
Theorem 1 (A * and Complementary Conditions) Hence, one can evaluate A * by looking at the discrepancies between the condi- tional distribution and its complementary conditional distribution.When doing so, the samples obtained from both distributions are distinct.
Our dynamic slicing scheme is defined based on a parameter , the expected share of observations in the slice c i .Thus, the expected share of observations in c i equals 1 − .As a result, we set = 0.5 so that = .This choice is judicious for statistical testing, as having samples of equal size leads to higher statistical stability, and we get rid of parameter .
We illustrate slicing in heterogeneous subspaces in Fig. 1, with an exemplary numerical and categorical subspace in the left half and a heterogeneous subspace in the right half.The black lines show a random slice.The points in dark blue are in c i , and the points in light orange are in c i .Figure 1a represents a numerical lin- ear dependency.We can see from the histograms that, after slicing, the distribution of the points in each sample are very different.Figure 1b depicts the absolute frequencies of observing various symptoms {U … Z} in different groups of patients {A … F} .Since there is no ordering within symptoms and patient groups, slicing consists of selecting categories at random, here {B, C, E} .By comparing the abso- lute frequencies after slicing, we can determine whether there is a statistical association between groups and symptoms.Naturally, the statistical test that we use to estimate the discrepancy between p s i |c i and p s i |c i might differ depending on the type of the reference attribute, as we will discuss later. (5) Different attribute types can also be part of the same subspace, as we show in Fig. 1c and d.We graph the height from a sample of individuals of two sexes.When we slice on the x-axis, the slice is a numerical interval.On the y-axis, in turn, the slice is a category drawn at random.
Intuitively, ordinal attributes share features from both numerical and categorical attributes: There exists an ordering between values, but typically also a large number of tied values.In this case, we recommend to use a similar slicing methodology as for numerical attributes, by selecting a discrete interval (see Definition 5), and a statistical test that is robust to tying values.

Properties of contrast
A statistical test T B 1 , B 2 between two samples B 1 and B 2 typically yields a p-value, which one uses to determine the statistical significance of a hypothesis.Conversely, p = 1 − p is known as the confidence level.The rationale behind estimating the degree of dependency D * is to yield values quantifying the independence violation.We define contrast, abbreviated as C , as the expected value of the confidence level of a statistical test T between the samples from the conditional distributions for all the possible attributes s i and slices c i : where T yields p-values, and B(c i ), B(c i ) are the samples resulting from slicing.We draw the conditions in c i randomly and independently, w.r.t.any reference attribute s i in subspace S. By definition, T ∼ U[0, 1] when the two samples are independent, and T ≈ 1 as the evidence against independence increases.The properties of C follow: 1. C converges to 1 as the dependency in S increases. 2. C converges to 0.5 when S is independent, since T ∼ U[0, 1]. 3. C is bounded between 0 and 1.

Monte Carlo approximation
It is impossible to compute C exactly; one would need to know the distribution of B(c i ) and B(c i ) for every slice.Instead, we approximate C via Monte Carlo (MC) simulations, with M iterations.In each iteration, we choose the reference attribute s i and slice c i at random.We define the approximated contrast Ĉ: where c i m ∼ P c (S) means that we draw c i randomly from P c (S) in iteration m.
Interestingly, we can bound the quality of the approximation.From Hoeffding's inequality [20], we derive a bound on Ĉ w.r.t.C , which decreases exponentially with increasing M: where M is the number of MC iterations, and 0 This bound is very useful.For instance, when M = 200 , the probability of Ĉ to deviate more than 0.1 from its expected value is less than 2e −4 ≈ 0.04 , and this bound decreases exponentially with M. Thus, one can adjust the computational requirements of Ĉ given the available resources, the desired quality level, or the rate of arrival of new observations in a stream.In other words, users can set M intuitively, as it leads to an expected quality level and vice versa.Observe that M is our only parameter.

Instantiating MCDE
Finally, one must instantiate a suitable statistical test as T .Ideally, this test is non- parametric (F4) and suitable for the type of the reference attribute (numerical, ordinal, categorical).To facilitate meaningful experiments, we investigate instantiations of MCDE based on three well-known non-parametric tests: the Kolmogorov-Smirnov, the Mann-Whitney U and the Chi-Squared test.We call the respective instantiations KSP, MWP and CSP.
The Kolmogorov-Smirnov test assumes the data to be continuous, i.e., it should be adequate for numerical attributes.The Mann-Whitney U test explicitly handles tying values.Thus it might work well with ordinal attributes.Lastly, the Chi-Squared test bases on frequencies from a finite number of categories, i.e., we hypothesise it to be suitable for categorical attributes.( 9) slice ← Slice(I, r) 5: result ← result + Test(I, slice, r) 6: return (result/M)∈ (0, 1) Algorithm 1 summarises the general idea behind MCDE for any arbitrary subspace S = {s 1 , … , s d } of dimensionality |S| = d .In practice, we can significantly improve the efficiency of slicing operations, which require the values of each attribute to be ordered, with an index structure (Line 1).Afterwards, for M iterations, we slice the data (Line 4) and carry out the statistical test (Line 5).The final estimate is the average of the test outcomes over M iterations.In what follows, we present the specifics of constructindex, slice and test for each instantiation of MCDE.

Instantiation as Mann-Whitney-P (MWP)
We first consider the instantiation of T as a two-sided Mann-Whitney U test [26].An advantage of this test is that it does not assume the data to be continuous, as it operates on ranks.So it is robust and applicable to virtually any kind of numeric or ordinal measurements.

Marginal restriction
In a nutshell, the Mann-Whitney U test compares the difference between the median of the two samples.However, it is known that the ability of this test to detect dependency-the so-called 'statistical power'-declines in the case of unequal variance of the two samples [11,47].To alleviate this issue, we include an additional step into the slicing process.It restricts the domain of the reference attribute s i to a share of observations.Formally, we define the marginal restriction as follows: Definition 8 (Marginal Restriction) A marginal restriction is a condition on the reference attribute s i , i.e., an interval We illustrate in Fig. 2 the impact of the marginal restriction.We show in the left half a circular dependency and in the other half a linear dependency.Two grey lines show a marginal restriction (in Fig. 2b, d), and two vertical dashed lines stand for the median of each sample.After slicing, both samples have highly unequal variance (see Fig. 2a, c).However, in Fig. 2a, the median of both distributions are nearly equal.Thus, this dependency remains undetected.The marginal restriction solves this problem, as we see in Fig. 2b.However, there is almost no difference between Fig. 2c and d.Intuitively, the marginal restriction 'breaks the symmetry' 1 3 Distributed and Parallel Databases (2021) 39:415-444 between both distributions.Because of that, the MWP estimator with marginal restriction has higher statistical power against certain dependency types.(l i , xi ) ← sort (r i , x i ) by s i in ascending order, break ties randomly 4: Algorithm 3 shows the slicing process.We successively mask the row numbers based on a random condition for all but one reference attribute s r .Additionally, we ensure that the condition boundaries do not split any tying values and that each condition has w ′ observations.The algorithm returns a slice , i.e., a list of w Boolean values, so we write slice ∈ ℤ w 2 .

Implementation details
Algorithm 3 MWP-Slice(I : {I 1 , . . .,I d }, r ∈ {1, . . ., d}) Algorithm 4 implements the statistical test based on our index.We determine a restriction [start, end] on s r and sum the adjusted ranks of the observations in the slice.Since the ranks in this subset may not start from 0, we adjust the sum of the ranks R 1 (Line 10).Then we compute a correction term (Line 13) using the cumula- tive correction b r to adjust for ties (Line 14).Φ 1∕2 is the cumulative distribution function of the half-Gaussian distribution.

Instantiation as Kolmogorov-Smirnov-P (KSP)
We now describe another instantiation of MCDE, which uses the two-sample Kolmogorov-Smirnov (KS) test as a statistical test for T .The (two-sample) KS test is non-parametric and tests the equality of two continuous one-dimensional probability distributions.It is adequate in case the reference attribute is numerical.However, the KS test has less power in the presence of ties [24].So it may not work well with ordinal attributes.
In a nutshell, the two-sample Kolmogorov-Smirnov statistic D is the supremum of the absolute differences of the empirical cumulative distribution functions F 1 (x) and F 2 (x) of Samples 1 and 2 with n 1 and n 2 elements: HiCS [21] employed this test statistic to quantify the contrast of a subspace.However, to comply with the MCDE framework, one must first derive the corresponding p-value.The p-values are not trivial to obtain, because the distribution of D does not have any known closed form, and the time required for an exact computation increases with n 1 and n 2 in particular.
We approximate the p-values using the asymptotic Kolmogorov-Smirnov distri- bution proposed in [10]: Empirically, we found that summing up the first 1000 terms of the expansion provides enough accuracy for our application, without much impact on the execution time.Using this approximation is common practice within most modern statistical software, such as R [34].
Algorithm 5 is the pseudo-code for the index construction.The difference to MWP is that we do not need to adjust ranks or precompute tie correction, because the Kolmogorov-Smirnov test assumes no tying values.The resulting data structure contains the indexes l i and the values x i of each attribute s i , ordered by s i in ascending order.
Algorithm 5 KSP-ConstructIndex (l i , xi ) ← sort (r i , x i ) by s i in ascending order, break ties randomly 4: Similarly, Algorithm 6 is responsible for slicing but does not require any further step to handle ties.Algorithms 2 and 5, as well as Algorithms 3 and 6 behave in the same way whenever the data does not have ties.Algorithm 7 implements the KS test.We compute the statistic D, i.e., the maximal difference of the two empirical cumulative distribution functions in Lines 7-10.Then we approximate the p-value with Eq. 12 in Line 12.

Instantiation as Chi-Squared-P (CSP)
The Chi-Squared test, also known as Pearson's Chi-Squared test, perhaps is the most famous non-parametric statistical test.In short, it determines whether there is a significant difference between the expected frequencies and the observed frequencies of a set of categories.
For a reference variable s i ∈ Cat with categories A = {a 1 , … , a |A| } , we sketch the contingency table w.r.t. the two samples B(c i ) and B(c i ) in Table 1, where o i j is the absolute frequency of Category a j in Sample i, and we have: where e j i = o i ⋅ o j ∕w is the expected absolute frequency.Under independence, Q follows the 2 distribution with cumulative distribution function where k = |A| − 1 is the number of degrees of freedom.Thus, 2 k (Q) leads to the p-value that we use for CSP.Similarly to the other instantiations, we improve the execution time by constructing an index.It contains the position of each occurrence of a categorical value binned into its respective category.Algorithm 8 is our pseudo-code for its construction.We can construct the index in linear time with a single pass over each attribute, as it does not require any sorting.
Algorithm 8 CSP-ConstructIndex Define I i as a mapping of categories to positions ⊂ {0, . . ., w − 1}, counts ∈ N + 3: Algorithm 9 is our slicing procedure for CSP.The main difference to MWP and KSP is that the values of the index do not have any meaningful ordering.Thus, slicing consists of selecting a random set of categories per attribute.The algorithm ensures that exactly w ′ observations are part of each condition.( 14) Finally, Algorithm 10 shows how to compute the Chi-Squared statistic, based on the information from the index and a subspace slice.

Heterogeneity
For simplicity, we have described KSP, MWP and CSP, assuming a homogeneous data set, being numerical, ordinal and categorical respectively.Each of our algorithms treats the attributes of a subspace independently.So we can easily extend Algorithm 1 to the heterogeneous setting, as we show in Algorithm 11.We construct the index of each attribute depending on its type (Lines 1-3) and use the corresponding slicing methodology (Lines 7-9).The resulting slice is the element-wise conjunction for each type (Line 10).The type of the reference 1 3 Distributed and Parallel Databases (2021) 39:415-444 attribute determines which test we should use (Lines 11-13).Independently from the underlying statistical test, the p-values have the properties described in Sect. 4.2.4.So the final MCDE score is the average of the p-values over each iteration.

Adaptation to the streaming setting
To deal with streams, we adopt the well-known sliding window model, i.e., we only consider the w most recent observation.A naive way to support this model is to recompute the index at the arrival of each new observation.Instead, we propose efficient insertion and deletion operations for our indexes.
Furthermore, to maintain a dependency estimate over time, we propose to perform a number M of MC iterations periodically and report the exponential moving average: where is the so-called decaying factor, and Δ is the step size.
We update the MWP index in Algorithm 12 in two steps: step 1: insert/delete and step 2: refresh.Our algorithm maintains two data structures: a queue, which determines for each new point the value of the point to delete in the current window, in a first-in-first-out fashion, and a variant of our static index which supports binary search.In the first step, we store the values for each attribute in a queue, in chronological order.Then we find the positions where to insert the new point and where to delete the oldest point in the index via binary search.In the second step, we recompute the adjusted ranks and cumulative correction as in Algorithm 2.
Using a data structure like a binary tree, step 1 only has logarithmic complexity, while step 2 has linear complexity.Besides this, one must perform step 1 for each new point, but step 2 only once before slicing.So when re-estimating contrast only every Δ-th step, we perform step 1 for each point, but step 2 lazily.As a result, we can update the index in O(d ⋅ log(w)) in step 1 for each new observation and post- pone step 2, which is in O(d ⋅ w) , to contrast estimation.Updating the KSP index ( 16) is somewhat simpler because KSP does not handle tying values.The CSP index is unsorted and thus step 1 only requires constant time.We summarise the complexity of each step in Table 2 and refer the interested reader to our source code. 2or efficiency, Algorithm 12 simultaneously inserts and deletes observations.Note that one could also perform the insert and delete operations via two independent methods, i.e., effectively handling time-based windows, in which observations may arrive at arbitrary time steps, or in batches.

Experiments
To evaluate our dependency estimators, i.e., MWP, KSP and CSP, we look at the scores they produce w.r.t. an assortment of dependencies.See Fig. 3; we scale the dependencies to [0, 1].For each dependency, we repeatedly draw n objects with d dimensions, to which we add Gaussian noise with standard deviation , which we call noise level.Intuitively, noise-free dependencies should lead to higher scores than noisier ones.We also show that MCDE is robust and handles heterogeneity by simulating numerical, ordinal, and categorical attributes.To simulate ordinal attributes, we discretise numerical attributes into a number Ω of values from 1 to 20.To simu- late categorical attributes, we randomly permute the discretised values, to mimic the absence of an order.
Similarly to other studies [23,28,35], we compute the statistical power, defined as follows: Definition 9 (Power) The power of an estimator E w.r.t.dependency O with , n and d is the probability of the score of E to be larger than a -th percentile of the scores w.r.t. the independent subspace I: n×d is a random instantiation of a subspace as dependency O with noise level , which has n objects and d dimensions.{x} P stands for the -th percentile of the set {x} , i.e., a value v so that % of the values in {x} are smaller than v.
The attributes of the independent subspace I are i.One can interpret 'power' as the probability to correctly reject the independence hypothesis with % confidence.I.e., the power quantifies how well a dependency measure can differentiate between the independence I and a given dependency O with noise level .For our experiments, we set = 95 , n = 1000 .We let the noise vary linearly from 0 to 1, with 30 distinct levels.We consider dependencies with dimensionality d from 2 to 20.
In our figures, O Ω denotes each dependency, where O stands for the depend- ency type (e.g., L stands for 'Linear'), and Ω is the discretisation level, i.e., the number of distinct values; O means that the attributes are numerical.d=x indicates the dimensionality, and O * Ω indicates that the attributes are categorical, with a number Ω of nominal values.

Specifics of attribute types
We first observe how MCDE handles numerical, ordinal, and categorical attributes.Figure 4 displays the empirical distribution of MWP, KSP and CSP iterations w.r.t. a 2-dimensional independent (I) and a linearly dependent (L) space.We visualise each distribution as a histogram from 10 000 independent iterations.The vertical dashed lines show the means of the distributions, and we display a scatter plot illustrating the respective scenario.According to Fig. 4a, MWP, KSP and CSP values are uniformly distributed in the case of an independent subspace, with a few exceptions: First, the CSP values are all close to 0 in the case of numerical attributes.In this setting, each point is unique, i.e., the Chi-Squared test treats each observation as a distinct category, and thus does not have power.We do the same observation with L (in Fig. 4b).Second, we see that the values of MWP and CSP are also 0 with I 1 , which corresponds to the desired behaviour.Every observation in the subspace is equal, so contrast is undefined.KSP assumes that values are continuous, and thus does not handle this case.
Also, we can see from Fig. 4b that the KSP values are generally closer to 1 with L and L 10 , which indicates more power than MWP and CSP.However, we can see that the CSP distribution does not change between L 10 and L * 10 , while the mean of the MWP and KSP distribution decreases significantly.Thus, CSP detects dependency from categorical attributes better than MWP/KSP.

Statistical power of MWP, KSP and CSP
We first look at the statistical power of MWP, KSP and CSP with confidence level = 0.95 against a linear dependency of increasing dimensionality d, discretisa- tion level Ω , and noise level .From Fig. 5, we see that MWP without marginal restriction does not work well in high-dimensional and highly discretised spaces.The marginal restriction alleviates this problem to some extent, in particular for numerical subspaces.In fact, as dimensionality increases, it becomes more and more likely that the points selected in the slice are 'in the centre' of the distribution.The mean of the points in the slice and outside of the slice become nearly equal, leading to the low power of the Mann-Whitney U test, and thus a slight performance decrease of MWP.This calls for further research on the MWP slicing scheme, or alternatives to the Mann-Whitney U test.Nonetheless, the results indicate that MWP with marginal restriction works well against numerical attributes.
Next, we see that KSP has high power in every case, although slightly decreasing with Ω .CSP does not work with numerical spaces but has more power in discrete spaces.CSP works best with categorical attributes.
We now compare the power of MWP, KSP, and CSP against the assortment of dependencies from Fig. 3. CSP does not apply to numerical attributes.Thus, for comparability, we discretise the values with Ω = 10 .We can see from Fig. 6 that KSP consistently has more power than MWP.CSP generally has less power than KSP but can detect categorical dependencies.
To summarise, our experiments show that MWP has a slight performance decrease in high-dimensional discrete spaces.KSP seems to perform better, but its statistical power decreases with discretisation.Overall, we recommend to use KSP against both numerical and ordinal data but to use CSP for categorical data.MWP still is a valid alternative with numerical attributes.

Scalability
We evaluate the scalability of index construction for each approach, by increasing the size of the sliding window w from 10 2 to 10 5 in an independent space I with three dimensions.The red line ('Construction') is the average time for creating the index with window size w (Algorithms 2/5/8).The other lines show the average time to insert a new point into the window.
In Fig. 7, we can see that the construction time of each index increases almost linearly with increasing window size w.The KSP index is less expensive to update than MWP, regarding step 2 in particular.The first step of the CSP index update is very efficient, as it requires more or less constant time, cf.Table 2.We can see that only performing step 1 in our update operations, while delaying step 2 to the contrast estimation step, reduces the execution time by up to 3 orders of magnitude, compared to standard index construction.So we can significantly speed up the monitoring of MCDE contrast using the index update operations.
In Fig. 8, we compare the execution time of contrast estimation for MWP, KSP, and CSP, with increasing window size w.We can see that the three approaches have a comparable execution time.KSP is slightly slower for small window sizes because the p -values are more expensive to obtain than with the other approaches.However, as the window size increases, KSP and MWP have the same execution time.CSP contrast estimation appears to be slightly slower as the window size increases but does scale as well.

Deployment to the streaming setting
We monitor contrast in a subspace in which the dependency gradually evolves.We simulate this setting by concatenating 100 three-dimensional linear dependencies with 1000 observations each and a level of noise linearly increasing from 0 to 1.We use the approach described in Sect.4.6.2,Eq. 16 and instantiate MCDE with MWP.We estimate the dependency over a sliding window of size w = 1000 and with a decaying factor = 0.99.
We let the step size Δ vary from 1 to 1000 and the number of iterations M from 1 to 500.We compare each configuration to a baseline, which is the most expensive configuration ( Δ = 1 , M = 500 ), without the benefit of our update operations.When Δ > 1 , we simply set the current contrast estimate to the value from the latest esti- mation.We define the following measures: -The Absolute Error is the average absolute difference between the values obtained with the tested configuration and the values from the baseline.-The Relative Time is the ratio of the time required by the tested configuration over the time required by the baseline.-The Index Speedup is the ratio between the time required by the tested configuration without/with our index update operations.
We can see from Fig. 9 that the absolute error decreases with Δ , while the relative execution time increases.The speedup obtained by our index operations is mainly responsible for this gain of efficiency.As we increase the number of iterations M, the errors tend to decrease, but at the same time, contrast estimation dominates the overall execution time.In such cases, we see less benefit from our efficient insert/ delete operations.We identify the configuration M = 50 , Δ = 50 as a sweet spot: For an absolute error as small as ≈ 0.01 , the computational burden is reduced by up to 100 times, with a consistent index speedup of 3. We mark this configuration with a star * .

Pattern discovery
We now apply MCDE to a real-world multivariate time series.We collected the data during a 4-day production campaign at Bioliq, a pyrolysis plant in the surroundings of Karlsruhe [33].It contains one measurement per second, i.e., 345 600 observations, from a selection of 20 physical sensors in various components of the plant.We monitored the evolution of dependency between each sensor pair with w = 1000, Δ = 50 , M = 50 , as just explained.We obtained the evolution of depend- ency between the 20 sensors using a single CPU core in about 2 hours.Note that it would be easy to shorten the computation time significantly by parallel processing.
We have presented the results of our monitoring technique to the plant operators.They have identified several patterns which they deemed 'interesting', i.e., patterns yielding insights that could help with plant operation.As we see, the dependency between these two measures changes significantly over time.Some of the changes, marked in the figure from 1 to 4, appear to represent different stages in the production process.A better understanding of the dynamics of the physical measures involved in the reaction will help the plant owners to operate smoothly and efficiently.

Discussion
Our experiments show that MCDE fulfils all H-DS requirements and has the desirable features of a framework for dependency estimation.
First, we can see that MCDE is efficient (C1), and, thanks to the index update operations, one can use it in combination with the sliding window model to mine data streams in a single scan (C2) and adapt (C3) the contrast estimation over time, taking concept drift into account.Second, one can also reduce or increase the number of MC iterations M to trade between accuracy and computation time, in an anytime (C4) fashion.So our method also is intuitive to use (F3) and interpretable (F5).The different instantiations of MCDE allow dealing with various attribute types within the same data stream (C5).Last, MCDE is non-parametric (F4) by design.Our experiments against an assortment of dependencies show that it is multivariate (F1), general-purpose (F2) and robust (F7).MCDE is sensitive (F6) because estimates are the average of statistical p-values.Our previous work [15] has shown the superiority of MWP over other methods.

Conclusions
In this paper, we have proposed a framework to estimate multivariate dependency in heterogeneous data streams.It fulfils all requirements one would expect from a state-of-the-art dependency estimator.MCDE provides high statistical power on a large panel of dependencies while being very efficient.Furthermore, we introduced index operations for the streaming setting and illustrated the benefits of our framework against a real-world use case.
Future work could focus on improving our monitoring scheme: While updating the MCDE score via an exponential moving average appears to be a valid option, MCDE could benefit from a more flexible update mechanism, using for instance a sliding window of adaptive size [4].Finally, we have only considered three instantiations of MCDE; it would be interesting to study the integration of further statistical tests, e.g., see [13] and [5].

3
Distributed and Parallel Databases (2021) 39:415-444 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creat iveco mmons .org/licenses/by/4.0/.

( 8 )
C(S) ≡ c i ∈P c (S) T B(c i ), B(c i ) Heterogeneous ii.

Algorithm 2
is the pseudo-code for the index construction.The index I is a one- dimensional structure containing the adjusted ranks and tying value corrections for each attribute.It consists of |S| = d elements {I 1 , … , I d } , where I i is an array of 4-tuples [(l 1i , x1i , a 1i , b 1i ), … , (l wi , xwi , a wi , b wi )] ordered by s i in ascending order.In this index, l i are the row numbers, xi are the sorted values of s i , a i are the adjusted ranks and b i the accumulated correction terms.I i [j] stands for the j-th tuple of I i , and l ji , xji , a ji , b ji are its components.Algorithm 2 MWP-ConstructIndex 1: for i = 1 to d do 2: r i ← [0, . . ., w − 1] 3:

Fig. 5
Fig. 5 Power against continuous and discrete linear distributions (d from 2 to 20)

Fig. 7 8
Fig. 7 Time required for index construction and update w.r.t.window size w

1 3
Figure 10 displays one of these patterns.It is the result of monitoring two sensors, namely the pressure at the reactor input, and the oxygen concentration at the output.As we see, the dependency between these two measures changes significantly over time.Some of the changes, marked in the figure from 1 to 4, appear to represent different stages in the production process.A better understanding of the dynamics of the physical measures involved in the reaction will help the plant owners to operate smoothly and efficiently.