Correcting statistical models via empirical distribution functions

We consider the two-sample homogeneity problem where the information contained in two samples is used to test the equality of the underlying distributions. In cases where one sample is simulated by a procedure modelling the data generating process of another observed sample, a mere rejection of the null hypothesis is unsatisfactory. Instead, the data analyst would like to know how the simulation can be improved. Based on the popular Kolmogorov–Smirnov test and a general mixture model, we propose an algorithm that determines an appropriate correction distribution function. Complementing the simulation sample by a given proportion of observations sampled from this distribution reduces the Kolmogorov–Smirnov distance between the modified and the observed sample. Therefore, the correction distribution indicates possible improvements to the current simulation process. We prove our algorithm to run in linear time when applied to sorted samples. We further illustrate its intuitive results on simulated as well as on real data sets from astrophysics and bioinformatics.


Introduction
Our work focuses on developing a new statistical method to assess the quality of simulations and to provide useful information that can help to improve simulation procedures. Since often arbitrary amounts of simulated data can be generated, the algorithm is mainly designed for applications in different domains of research dealing with medium to large data sets, but it is also applicable for small sample sizes. In order to give a clear motivation to the problem under study, we consider the gamma ray detectors MAGIC-I and MAGIC-II. These telescopes are located at the Roque de los Muchachos on the Canary Island La Palma. The interested reader is referred to Cortina et al. (2009) and The MAGIC Collaboration (2014) for detailed information. The telescopes make use of a mirror surface of over 200 square metres to measure atmospheric signals induced by the interaction of high energetic photons, called gamma rays, with the atmosphere. Since they do not have an electric charge, gamma rays do not interact with magnetic fields and thus provide useful information about their sources which the physicists want to explore. However, these are not the only particles generating such atmospheric signals. For each gamma ray in the measurements there are about 1000 observations of so called background events, which are not of interest in the given context. The background events mainly consist of protons, but also contain heavier hadrons and electrons. Classification algorithms based on characteristics of the measured signals could be applied in order to distinguish between the background and the gamma particles. Unfortunately, these methods cannot be trained on real data, because it is not labelled. Therefore, simulation procedures for gamma rays as well as for protons have been constructed based on models of particle propagation and improved in several steps. The main program that is used to generate such simulations is CORSIKA (Heck et al. 1998).
Clearly, it is of major importance to compare the simulated proton samples with the observed data, which is the aim of our algorithm. On the one hand, suitable artificial background data is crucial for the classification analysis. Hence, variables with low agreement of generated and real data must be identified. On the other hand, small deviations between simulations and real data can be caused by gamma ray signals. If one assumes to have a reasonable simulation, variables with comparably high discrepancies can be quite helpful in the upcoming classification task.
A typical statistical approach to check the similarity of two data sets is the application of a nonparametric two-sample test like for example the two-sample Kolmogorov-Smirnov test. However, a mere rejection of the null hypothesis is not satisfying in practice. If the simulations are seriously inadequate, the data analyst wants to quantify the issues by identifying the regions with too many or insufficiently many observations, respectively. Such information can then be used to update COR-SIKA or related methods using more suitable hyperparameters or even give rise to the inclusion of additional simulation steps in the particle simulation. If the discrepancies between the samples potentially stem from the gamma ray signals, their quantification is necessary as well, allowing to assess and validate the gamma ray simulations.
In the present work, we develop a novel approach allowing to gain additional insight into the discrepancies between an observed and a simulated sample based on the twosample Kolmogorov-Smirnov test. Note that while the application aims at designing a reasonable simulation procedure, our contribution helps to improve an existing simulation procedure which is based on prior domain specific knowledge. We therefore assume that such a simulation procedure exists a priori. In order to compare the data sets, we make use of a mixture model linking the distributions of the observed and the simulated samples by a third distribution. The latter represents all discrepancies between the first two and will be called correction distribution hereafter. We propose an algorithm that determines an empirical distribution function of this correction distribution and a mixing proportion in the mixture model. Both are determined such that the resulting mixture with the simulated data does not lead to another rejection by the Kolmogorov-Smirnov test when compared to the observed data. Note that the algorithm in principle does not aim at statistical testing and thus the type I and type II error after modification of the simulated sample are not investigated. The method rather makes use of the quantiles of the Kolmogorov-Smirnov distribution to obtain intuitive bounds on the distance between empirical distribution functions. The algorithm does not construct a mixture fitting the observed data perfectly, but leads to a reasonably close approximation taking the sample variance into account. The amount of closeness can be regulated by the critical value K α or equivalently by the significance level α and may be adjusted for a given application. For the sake of brevity, we formulate the problem focusing on the improvement of a simulation procedure in the following. However, the method can also be applied to approximate the distribution of subgroups in the data as argued for the gamma ray signals above.
There already exist various semi-and nonparametric suggestions on mixture models in the literature. They focus on the estimation of the densities and the number of components in a mixture model of a given single sample. In this situation the authors often have to work in a multidimensional setting that allows to impose more information about the structure. As shown by Hall et al. (2005), the quantities in a reasonable mixture model with two nonparametric components are not identifiable for one-and two-dimensional problems even under certain independence assumptions. The methods often rely on adjusted versions of the EM algorithm (Pilla and Lindsay 2001) or a Newton method (Wang 2010;Schellhase and Kauermann 2012) and in addition can make use of appropriate data transformations (Hettmansperger and Thomas 2000). There also exist several nonparametric approaches to problems involving multiple samples and finite mixture models, as for example proposed by Kolossiatis et al. (2013). However, to the authors' knowledge, there is no literature addressing the two sample problem outlined above in the context of mixture models.
We close this gap by proposing a correction of one sample to resemble another sample based on the corresponding empirical distribution functions. This allows us to derive simple, monotone and asymptotically distribution free confidence bands. Working with nonparametric density estimators may seem more intuitive, especially since many of them (e.g. Schellhase and Kauermann 2012) attain the appealing form of convex combinations of suitable basis functions: The natural idea to solve our problem in this setting is forcing the density approximation of the simulated sample to be part of the basis for the estimation on the observed sample. One thus first estimates the density on the simulated sample. In a second step this estimatorĝ is fixed as one of the basis functions, say f 1 =ĝ, for the estimation of the density on the observed sample. The main problem with this approach is that in general one cannot guarantee that the corresponding coefficient a 1 properly reflects the importance of the simulated data. Often a 1 will be way too small if the remaining basis functions are not chosen appropriately, because these also contribute to the region of values modelled by f 1 . In our setting this corresponds to discarding the simulation almost completely, which is not desirable. Since choosing the remaining basis functions in an adequate way is nontrivial, straightforward application of density based approaches are not satisfactory in our application. In addition, the optimisation of the coefficients and potentially the basis functions themselves can be computationally costly, especially in the medium to large sample cases we aim for.
The remainder of this paper is structured as follows: In Sect. 2 we formalise the problem and propose a mixture model related to the nonparametric two-sample problem. We introduce several desirable properties of the model parameters and formulate two optimisation problems allowing to identify them. In Sect. 3 we present an algorithm to solve the problems introduced in the second section and provide intuitive explanations of the main ideas of each step of our algorithm. The proofs of correctness and linear running time are conducted in Sect. 4. In Sect. 5 our procedure is applied to real and simulated data and the results are illustrated. Section 6 concludes with a summary and an outlook on possible future work.

Problem definition
In this section we introduce our basic notations and consider a general mixture model based on nonparametric estimators. Within this model, all discrepancies between the distributions of the observed and the simulated data, respectively, are represented by a third distribution. Our task will be to asses this so called correction distribution in order to identify the aforementioned discrepancies. To this end, the model is transferred to an empirical equivalent. Thereafter, several constraints on parameters of the empirical model are motivated allowing to identify them properly.
We observe the sample x 1 , . . . , x n 1 ∈ R stemming from an unknown continuous distribution F. The underlying data generating process is modelled by a simulation procedure represented by the distribution G. To evaluate the quality of the simulation, consider n 2 simulated observations y 1 , . . . , y n 2 drawn independently from G. If the simulation procedure works well, G resembles F and thus the samples are similar.
To test the equality of F and G using the two-sample Kolmogorov-Smirnov test denote the empirical distribution functions of the samples by F e and G e , respectively, and let N = n 1 ·n 2 n 1 +n 2 . Setting M = R the null hypothesis H 0 : F = G is rejected if the test statistic exceeds an appropriately chosen critical value K α .
In order to consider the procedure from a different perspective, we define an upper boundary function U setting U (x) = min(1, F e (x) for all x ∈ M, and in analogy define a lower boundary function L by called the confidence band. As argued above, we are interested in the regions of undersampling respectively oversampling, i.e., the regions where G e violates L or U .
To model the problem described above, we work with the fairly general twocomponent mixture model where the so-called mixture proportion or shrinkage factors measures the degree of agreement of F and G while the distribution H represents all dissimilarities between F and G. Since F is fully described by G,s and H , we are interested in identifying s and H , because these quantities contain all relevant information for an appropriate modification of G.
It is clear that the choices = 0 and H = F solves the above equation. However, this solution is not of interest in our setting, because the data analyst wants to correct and not to discard the current simulation, which is often based on expert knowledge. This may give more insight into the data generating process itself and is thus preferable. In the other extreme case,s = 1, the simulation is correct and H is irrelevant. However, for anys ∈ (0, 1) the corresponding H is unique and demixing F, that is estimating s and H , provides useful information for improving the simulation.
Since the distributions F and G are not available in practice, we replace the corresponding distribution functions by the standard empirical estimators F e and G e . Combining the Kolmogorov-Smirnov distance with the above mixture model, we propose to identify a shrinkage factor s ∈ [0, 1] and a correction function H such that the function that is a superset of the set of all distribution functions on R. Obviously, neither s nor H are unique in this situation. Hence, in the following we set some additional constraints and describe the problem in more detail allowing us to determine reasonable solutions.
Since we work with empirical distribution functions, all derived quantities are characterized by their values on the joint sample x 1 , . . . , x n 1 , y 1 , . . . , y n 2 . Therefore, instead of considering all functions H ∈ M, we restrict ourselves to those which may be discontinuous only on Z = z 1 , . . . , z n 1 +n 2 consisting of the ordered joint sample. We denote this set of functions by M * ⊂ M. This restriction is not very strong regardless of the sample sizes, since it does not make a difference for the Kolmogorov-Smirnov distance whether we stick to the original data or add observations on intermediate values. In addition, the choice of the value for an added observation would be arbitrary between two of the given data points with respect to our distance. Thus, we focus on the original observations which has the additional advantage of lower computational cost. Also, the sample sizes in applications we aim for are comparably large so that the restricted set is often quite dense.
Motivated by the fact that the data analyst is interested in making as little changes as possible concerning the current simulation, we can make the mixture proportion s in the model identifiable by choosing s maximally such that the mixture F fits the observed data. This directly implies a minimal weight 1 − s for the correction function H. We thus formulate Problem 1: Note that for s * = K α √ N and H * = 1 1−s * · L the property s * · G e + (1 − s * ) · H * ∈ B holds. Thus, the optimal value of s, called s opt in the following, is always greater than 0. Hence, the simulated data is always properly included in the mixture.
After Problem 1 is solved, the resulting mixture F = s opt · G e + (1 − s opt ) · H lies in B. Since this does not imply the property lim x→∞ F(x) = 1, the function H could be an improper distribution function. Therefore, there might exist several choices of H solving Problem 1 given s opt . However, consider the pointwise minimal function H min defined by ∀z ∈ Z : H min (z) = min H(z). Hereby, the minimization is taken over the non-empty compact set of all functions H ∈ M * satisfying F min = s opt · G e + (1 − s opt ) · H ∈ B. The function H min is clearly unique. To find a reasonable distribution function H, we propose to first identify and then enlarge H min , that is to construct H such that H ≥ H min , in an adequate way, so that the final mixture is a proper empirical distribution function lying in B.
Before we formulate this enlargement as an optimisation problem, we want to point out that, quite intuitively, H min should not be enlarged for small z ∈ Z . In particular, if F min intersects the upper boundary U , adding mass before the maximal value z ∈ Z where F min (z) is equal to U (z), that is, z meq = max z∈Z {z|F min (z) = U (z)}, leads to violations of U in z meq . Note that in case of such an intersection, the global Kolmogorov-Smirnov distance on M = R between the final mixture and F e will be the radius of the confidence band, regardless of the enlargement of H min . However, on subsets of R the distance can be improved if H min is enlarged appropriately. Hence, we propose to identify z norm , the smallest value after z meq such that adding mass after z norm minimises the Kolmogorov-Smirnov distance restricted to the set M norm = {z ∈ Z |z ≥ z norm }. We then add the probability mass in such a way that the minimal distance D {z≥z norm } is attained. If there is no intersection between F min (z) and U (z), we set z meq = min {Z } and proceed in the same way. Using the notations introduced above, finding a suitable distribution function H for a given value of s opt can be formalised in Problem 2: An optimal solution to Problem 2 is called H opt . The corresponding final mixture is denoted by Note that, even with these constraints, the solution to the problem of identifying F ∈ B may not be unique. Although the shrinkage factor s opt is unique by its maximality property, there may be several optimal enlargements of H min equally appropriate in the sense of the restricted Kolmogorov-Smirnov distance.

The algorithm
In this section we propose an algorithm for solving Problems 1 and 2 introduced in Sect. 2. At first, the main procedure is described. All subsequent subroutines called within the main algorithm are explained in more detail hereafter. Pseudocode is provided in order to illustrate the algorithm and its subroutines.

The main algorithm
Algorithm 1 is our main procedure to solve Problems 1 and 2. It requires two sorted sample vectors x ∈ R n 1 , y ∈ R n 2 and a significance level α. At first, it calculates the empirical distribution functions F e and G e of the samples and determines the critical value K α at level α. In fact, K α is the α-quantile of the distribution of Durbin 1973). For the typical significance levels α 1 = 0.05 and α 2 = 0.01 the critical values are K α 1 = 1.358 and K α 2 = 1.628, respectively. The values s and F, candidates for the shrinkage factor s opt and the final mixture F opt , are initialised and the lower bound for performing a binary search is set to s * (cf. the description of Problem 1). The upper and lower boundary functions of the confidence band around F e , U respectively L, are computed next. These steps can be considered as preprocessing and are carried out in the lines 1 and 2. The two-sample Kolmogorov-Smirnov test does not reject the null hypothesis of equal distributions if the relation L ≤ G e ≤ U holds. In this case the empirical distribution functions resemble each other well enough and the algorithm stops in line 4. If the test rejects the null hypothesis, the algorithm carries out certain steps to determine an optimal mixture within the confidence band. To solve Problem 1, the following operations are applied iteratively in the main loop in lines 5-11: a candidate F lying above the upper boundary somewhere has to be shrunk, that is, multiplied by a factor from the interval (0, 1), in order to correct the violation of U . This problem is addressed in line 7 in the so called Shrink-Down algorithm. On the contrary, a candidate falling below the lower boundary L must receive additional probability mass in appropriate regions. This is taken into account in line 9 by calling the Push-Up algorithm. The two operations are applied whenever necessary in the presented order. However, since they have opposite effects, some data situations require multiple executions of the Shrink-Down and the Push-Up step. Iteration of these steps generates a decreasing sequence of upper bounds to s opt . The well-known binary search technique embedded in the demixing algorithm in line 10 takes another approach by bounding s opt from below and above. It is connected with the Shrink-Down and Push-Up step by using the current shrinkage factor s learned from them as an upper bound to s opt . In return, the binary search updates s and F, which are then passed to the Shrink-Down and Push-Up steps. The lower bound for the optimal shrinkage factor, l b , is updated by the binary search itself.

Algorithm 1: Demixing-Algorithm
Input : Sorted observations x ∈ R n 1 , y ∈ R n 2 , significance level α Output: Optimal shrinkage factor s opt , optimal correcting function H opt ∈ M * Once the main loop is terminated, the optimal shrinkage factor s opt and the corresponding minimal correction function H min introduced on page 6 are determined and thus Problem 1 is solved. The normalisation step in line 12 takes care of Problem 2 returning an optimal mixture F opt . This allows to identify a reasonable correction function H opt in line 13 by rearranging Eq. (2), which is returned afterwards together with the optimal shrinkage factor s opt .
In the remainder of this section the subroutines of the main algorithm are described in detail.

The Shrink-Down algorithm
This procedure is applied whenever a candidate F exceeds the upper boundary U at some point. Following the mixture model (1), it is intuitive to solve this problem by computing the maximal shrinkage value s d ∈ (0, 1) such that s d · F does not violate U any more. In other words, F is shrunk down. The maximal shrinkage factor to achieve this is where we set a 0 = ∞ for every a > 0. The Shrink-Down subroutine presented in Algorithm 1 calculates this factor in line 1. Then, the total shrinkage and the candidate function F are updated accordingly and are eventually returned.

Algorithm 2: Shrink-Down
Input : Current values of F and s Output: Updated values F and s

The Push-Up algorithm
The Push-Up step presented in Algorithm 3 is carried out whenever the current candidate F violates the lower boundary of the confidence band, L. In order to increase the values of the mixture in the problematic regions, probability mass must be added there. Note that F may lie below L before the smallest value z ∈ Z where F(z) equals U (z), called z eq = min z∈Z {z|U (z) = F(z)}, as well as after that point. However, these two cases have a crucial difference. Adding probability mass before z eq leads to a new violation of the upper boundary U in z eq , while adding mass after z eq does not imply this problem. In order to distinguish between these cases, the algorithm first identifies z eq in line 1. This value z eq equals max(Z ) after initialisation with F = G e because F(max(Z )) = G e (max(Z )) = 1 = U (max(Z )) holds. As we argue later in Lemma 4, z eq is also well defined after modifications of F.
If there are violations of L before z eq , a shrinkage is necessary. Thus, keeping in mind Problem 1, the maximal shrinkage factor s u must be identified, so that the residuals to L before z eq do not exceed the residual to U in z eq after shrinking. Otherwise, adding appropriate probability mass will cause a violation of U in z eq . More formally, the shrinkage factor F (z eq )−F (z) } using basic arithmetic transformations of the constraint and F(z eq ) = U (z eq ). After s u is determined in line 3, the shrinkage factor s as well as F are updated.
In order to shift the current candidate F appropriately, the positive residuals d(z) = max {0, L(z) − F(z)} to L are computed for all z ∈ Z . These are the minimal amounts which must be added to F so that the lower boundary L is no longer violated. The residuals d are added to the current correction term F −s ·G e and the sum is minimally monotonised, cf. line 7. The result, denoted by H, is added to s · G e yielding the new candidate mixture F.

Algorithm 3: Push-Up
Input : Current values of F and s Output: Updated values F and s 1 z eq ← min

The Binary search algorithm
The binary search step presented in Algorithm 4 is called at the end of every iteration in the main loop. Its input consists of l b and u b , the current lower respectively upper bound for s opt . While l b is derived from previous binary search steps, u b is set to the current value of s. The algorithm computes the average of the given bounds in line 1. Using this candidate, the minimum monotone step function H b is computed such that F b = s b · G e + H b ≥ L holds, cf. lines 2 and 3. This is done in analogy to the corresponding lines in the Push-Up step.
If F b violates the upper boundary U , then, by minimality of H b , no monotone step function for the shrinkage factor s b can exist such that the corresponding mixture lies within the confidence band B. Therefore, as implied by the monotonicity property proved in Lemma 1 below, it holds that s > s b > s opt . Thus, in this case the algorithm updates s to s b as a new upper bound for s opt and sets the current mixture candidate to F b in lines 6 and 7. Otherwise, again by Lemma 1, the relation s opt ≥ s b > l b must hold, since there exists a monotone step function for the shrinkage factor s b leading to a mixture in B. Thus, s b is a better lower bound to s opt so that l b is updated to s b , while all other quantities are kept.

Algorithm 4: BinSearch
Input : l b and u b , current lower and upper bounds on s opt Output: Updated values F , s and l b

The normalisation step
As we show in Theorem 1 below, Problem 1 is solved when the loop of Algorithm 1 (lines 5-11) stops. At this point, the current value of s is the optimal shrinkage factor s opt , while the current mixture is F = s opt · G e + (1 − s opt ) · H min and lies within the confidence band. However, as pointed out in the description of Problem 2, F may not be a proper distribution function since lim x→∞ F(x) < 1 may hold. This deficiency is corrected by the normalisation step presented in Algorithm 5.

Algorithm 5: Normalise
To check whether F must be enlarged, the algorithm computes z meq , the maximal value z ∈ Z where F(z) equals U (z). When there is no intersection of F and U , which can happen if the last candidate mixture was proposed by the binary search, the algorithms sets z meq = min(Z ). If z meq = max(Z ) is satisfied, the property F(max(Z )) = F(z meq ) = U (z meq ) = U (max(Z )) = 1 holds, so no further corrections are necessary and F is returned.
Otherwise, as stated in the motivation to Problem 2, adding any probability mass before z meq would lead to a violation of U in z meq . Since s opt is already determined, such a violation cannot be repaired by further shrinking as in the Push-Up step. Thus, probability mass has to be added after z meq . In fact, the region where mass should be added can be restricted even further. Therefore, we denote by z norm the smallest value z ∈ Z such that z > z meq and F(z) < F e (z) holds. Since adding mass between z meq and z norm cannot reduce the Kolmogorov-Smirnov distance between F e and F, we focus on all z ≥ z norm in the following.
Hence, the deviations d(z) = F e (z) − F(z) are computed for all z ≥ z norm in line 7. Deviations above the remaining mass 1 − F(max(Z )) are decreased to this value. Hereafter, the algorithm checks whether the maximal increase of F above F e , the maximum of all negative deviations −d(z), is greater or equal to the imposed maximal decrease of F below F e , namely 1 − F(max(Z )). As long as this is the case, adding probability mass will not decrease the Kolmogorov-Smirnov distance. Hence, in line 9 the algorithm determines the last position where the above property holds and updates z norm to be greater than this position. This yields the set M norm = {z ∈ Z |z ≥ z norm }, where a reduction of the Kolmogorov-Smirnov distance is possible. At the latest, M norm is the last region where F lies below F e .
In order to compute a distribution function H norm , which has to be added to the remaining region M norm , the residuals d are considered on this set. Determining H norm there may be seen as an L ∞ isotonic regression problem. Since we work in the setting of distribution functions, a monotone function should be constructed, which fits the residuals d(z) best in the sense of the L ∞ -norm. Unweighted isotonic regression problems under the L ∞ norm can be efficiently solved in linear time for sorted samples. This can be achieved by a simple approach, which is referred to as Basic by Stout (2012). This method is applied in line 11 of Algorithm 5. For each residual it computes the maximum of all previous values and the minimum of all subsequent values and determines the regression value as the average of these two quantities.
Note that a solution to the isotonic regression may be negative, but the distribution function H opt must be nonnegative. However, as we will prove in Lemma 5, setting all of its negative values to 0 results in an optimal solution to the isotonic regression problem constraint to nonnegativity. Since no correction is applied before z norm , H norm is set to 0 before z norm in line 12. Finally, F is updated and returned.

Analysis of the algorithms
In this section theoretical results for the algorithms of Sect. 3 are provided. Among other things, we prove a monotonicity property allowing to apply the binary search technique to Problem 1 and demonstrate that the Shrink-Down and Push-Up step always lead to upper bounds on s opt proving their correctness. While the first part of this section deals with the correctness of our algorithm, the second one presents its runtime analysis.
We introduce the essential notations used repeatedly in our proofs. The shrinkage factor of G e , the correction function and the mixture candidate after the kth iteration of the main loop of Algorithm 1 (lines 5-11) are denoted by s k , H k and F k = s k ·G e +H k , respectively. In order to initialise them, we set s 0 = 1, H 0 = 0 and F 0 = G e . Let s d,k denote the update of the shrinkage factor determined in the Shrink-Down step in the kth iteration. Whenever this update is not computed, we set s d,k = 1. The update of the shrinkage factor determined in the Push-Up step of the kth iteration is called s u,k and treated in the same way.

Correctness of the algorithm
Our first result, mainly proving the correctness of the binary search step, shows that the property of lying within the confidence band is monotone in s. In other words, for any s > s opt a corresponding mixture violates a boundary of B, while for s ≤ s opt it is always possible to find a mixture lying in B.
Proof First we recall the definition of Problem 1 from page 6: where M * denotes the set of all nondecreasing, nonnegative step functions varying on Z only and converging to 0 for x → −∞. Now we introduce an alternative characterization of s opt by Problem A: Before we proceed with proving the proposition, we show the equivalence of Problem 1 and Problem A. For this sake, choose an arbitrary s ∈ [0, 1] such that (3) holds. Then for all z ∈ Z it follows that s · G e (z) ≤ U (z) − (1 − s) · H(z) ≤ U (z) by nonnegativity of (1−s)·H, which proves that the inequality (4a) holds. Furthermore, choose z < z from Z arbitrarily. Then z ) follows by monotonicity of H. Thus (4b) is also respected.
For the other direction let s ∈ [0, 1] respect constraints (4a) and (4b). From (4a) it is clear that s · G e (z) never exceeds the upper boundary. From (4b) we know that correcting any deficiency to the lower boundary L is possible without violating the upper boundary U on subsequent positions. Thus choosing will result in a mixture within the confidence band. This means that (3) holds.
We now make use of the above equivalence of Problem 1 and Problem A to prove the proposition: For s ∈ (s opt , 1] the property s ·G e +(1−s)·H / ∈ B immediately follows by definition of s opt . So let s ∈ [0, s opt ] be arbitrarily chosen and note that constraints (4a) and (4b) are respected for s opt . From this we deduce that both conditions must also hold for s since ∀z ∈ Z : s · G e (z) ≤ s opt · G e (z) ≤ U (z) and furthermore for all z , z ∈ Z with z < z it follows .
Since, as argued before, constraints (4a) and (4b) are equivalent to constraint (3), there exists an H for which the mixture s · G e + (1 − s) · H lies in B, which completes the proof.
In the next lemma, the correction function H k computed in the kth iteration of the main loop for the shrinkage factor s k is considered. We prove that H k is indeed the minimal function in M * resolving violations of the lower boundary L. This result contributes to the correctness of our construction of H min and is used in the subsequent proofs.
Lemma 2 H k is the pointwise minimal function among all H ∈ M * satisfying s k · G e + H ≥ L.
Proof Let H k,min ∈ M * be the minimal function with the property s k · G e +H k,min ≥ L. To prove the result we must thus show H k = H k,min . Now, the correction function H k is either computed in the binary search step or in the Push-Up step. In the first case, the residuals between s k · G e and the lower boundary L are determined and then minimally monotonised, cf. lines 2 and 3 of Algorithm 4. This monotonisation is performed by considering the maximum of preceding values and is therefore minimal. Hence, this procedure must yield H k,min . In the remainder of this proof we thus treat the second case, namely the computation of H k in the Push-Up step.
Following the lines 6-7 in Algorithm 3, we denote by d k the positive deficiencies to L after the shrinking in the Push-Up step of iteration k, i.e., d k = max(0, L − s d,k · s u,k · F k−1 ). SettingF k = s d,k · s u,k · F k−1 + d k , the correction function H k can be expressed as H k = mon F k − s k · G e . Thereby, mon( f ) denotes the minimal monotone function such that mon( f ) ≥ f . This monotonisation is performed analogically to the one in the binary search step by considering the maximum of preceding values. Note that the monotonising operator is itself monotone, that is, mon( f 1 ) ≤ mon( f 2 ) holds if f 1 ≤ f 2 . We show the proposition by induction: Base case k = 1: By assumption, H 1 is computed in the Push-Up step, so s 1 = s d,1 · s u,1 holds. In addition, F 0 is defined by F 0 = G e . Hence, d 1 = max(0, L − s d,1 · s u,1 ·F 0 ) = max(0, L −s 1 · G e ) ≤ H 1,min must hold, since the last inequality holds by definition of H 1,min . Because ofF 1 = s 1 ·G e +d 1 we obtain H 1 = mon(F 1 −s 1 ·G e ) = mon(d 1 ) ≤ mon(H 1,min ) = H 1,min , where the inequality follows by the monotonicity of the monotonising operator. Thus H 1 ≤ H 1,min is established. To prove the other inequality, note that H 1 ∈ M * and H 1 = mon(d 1 ) ≥ d 1 . Hence, H 1,min ≤ H 1 follows by the definition of H 1,min . Altogether, we get H 1,min = H 1 .
Inductive step k − 1 ⇒ k: The shrink updates s d,k and s u,k are bounded by 1 by construction and thus the inequality s k ≤ s d,k · s u,k · s k−1 ≤ s k−1 holds. Hence, the shrinkage factor s k does not increase in k and therefore the corresponding minimal correction function H k,min does not decrease in k. Consequently, we get H k,min ≥ H k−1,min ≥ s d,k ·s u,k ·H k−1,min . The correctness of the (k −1)-th step assumed by the induction principle yields H k−1,min = H k−1 resulting in H k,min ≥ s d,k · s u,k · H k−1 . Now, rewriting s d,k · s u,k · F k−1 to s k · G e + s d,k · s u,k · H k−1 allows to interpret d k as the minimal function which must be added to s k · G e + s d,k · s u,k · H k−1 so that the lower boundary L of the confidence band is not violated any more. Together with H k,min ≥ s d,k · s u,k · H k−1 established above this implies s d,k · s u,k · H k−1 + d k ≤ H k,min . Since in addition d k is by construction minimally chosen such that Applying the monotonising operator and exploiting its monotonicity this implies and therefore H k ≤ H k,min . To prove H k ≥ H k,min , note that H k is a function in M * . The inequalities (5) imply L ≤ s k · G e +H k . So, by definition of H k,min , H k ≥ H k,min follows and thus overall H k = H k,min holds.
The next result shows that the Shrink-Down step always leads to overall shrinkage factors not lower than s opt and therefore may be used as an improved upper bound for s opt in the binary search procedure.

Lemma 3 If s k > s opt then s d,k+1 · s k ≥ s opt .
Proof The proposition is trivial for s d,k+1 = 1 so in the following s d,k+1 < 1 is assumed. This means that the (k + 1)th Shrink-Down step is not skipped but executed. So F k must lie above the upper boundary U for some values.
Together with the definition of s d,k+1 , this ensures the existence of a z eq ∈ Z such that s d,k+1 · F k (z eq ) = U (z eq ) holds. In the following we consider the two possible cases for the correction function H k = F k − s k · G e : Case H k (z eq ) = 0: Using the definition of z eq and F k , we deduce where the last inequality follows since 0 < s d,k+1 < 1, 0 < s k ≤ 1 and 0 < G e (z eq ). The latter is satisfied, because otherwise 0 = G e (z eq ) and H k (z eq ) = 0 immediately imply 0 = U (z eq ), which is a contradiction to the positivity of U . The calculations above show that the function G e lies above the upper boundary U in z eq before any shrinking. However, the first Shrink-Down step would solve this problem and because of H k (z eq ) = 0 there cannot be a new violation of U in z eq in subsequent steps. Hence, k = 0 and consequently s k = 1 must hold. We have that s k · s d,k+1 = s d,1 ≥ s opt holds in this case, since s d,1 is by construction the maximal shrinkage factor for avoiding violations of U before adding any correction function.
Case H k (z eq ) > 0: LetH ∈ M * be the minimal function one must add to s d,k+1 · s k · G e in order to correct violations of the lower boundary L. Due to s d,k+1 ≤ 1 we get s d,k+1 · s k · G e ≤ s k · G e and thusH ≥ H k holds by minimality of H k shown in Lemma 2. This allows to prove Thus, s d,k+1 · s k · G e +H violates the upper boundary of the confidence band and thus does not lie in B. By minimality ofH Lemma 1 yields s d,k+1 · s k > s opt , which completes the proof.
The following proposition concerns the additional shrinkage performed in the Push-Up step. Similarly to Lemma 3, it states that a Push-Up step cannot lead to factors below s opt and therefore yields the correctness of using the overall shrinkage factor to improve the upper bound on s opt .
Proof The statement is immediately given for s u,k+1 = 1. It is also clear in case of k = 0 by construction of the shrink update s u,1 . So in the following let s u,k+1 < 1 and k ≥ 1 hold. We prove the proposition by contradiction so assume Consider the preceding candidate F k . F k / ∈ B must hold, because otherwise the algorithm would have stopped after k steps. Furthermore F k ≥ L is guaranteed by construction of the Push-Up and binary search steps. Therefore, F k must violate the upper boundary U in the assumed case k ≥ 1. Thus, a Shrink-Down step was executed before the current Push-Up step. Hence, the point is well defined, as pointed out in the description of the Push-Up step. The assumption s u,k+1 < 1 implies that a Push-Up step is carried out and ∃z ∈ Z : z < z eq . By definition of z eq , each z < z eq satisfies s d,k+1 · F k (z) < U (z) ≤ U (z eq ) and hence we deduce that Now consider the point By the definition of Also consider z = min{argmax z≤z eq H k (z)}. Using the minimal property of H k proved in Lemma 2, for k ≥ 1 one can deduce F k (z ) = L(z ), which implies z < z eq . For all z ≤ z we obtain where the first inequality holds because of F k ≥ L by construction of F k . Combining this result with the already mentioned fact that z < z eq holds, we get z ≥ z . Together with the monotonicity of H k and the definition of z we deduce We now combine (6), (7), (8) and (9) to prove the proposition. By Lemma 3 and s opt ≥ s * > 0 shown on page 6 the inequality s d,k+1 · s k > 0 holds. Thus, we can define s u2 = s opt s d,k+1 ·s k and inequality (6) implies which allows us to show

So we have
Hence we get = 0 by (9) ) ≤ H opt (z eq ) + s opt · G e (z eq ) where we also used s opt · G e + H opt ≥ L, which holds by definition of H opt . Thus, the upper boundary U is violated for s opt , which contradicts its definition, so that the proposition follows.
The next result justifies the way we correct a solution to the unconstrained isotonic regression problem in line 11 of Algorithm 5. To be more precise, we show that setting its negative values to zero leads to the same L ∞ -error as in the constrained problem and therefore yields an optimal solution to the latter. Keep in mind that the unconstrained isotonic regression problem is solved by the Basic approach (Stout 2012), which computes the maximum of all previous values and the minimum of all subsequent values for each observation and determines the regression value as the average of these two quantities. Proof We show the statement considering the two distinct cases min(x) ≥ 0 and min(x) < 0 consecutively. At first, assume that min(x) ≥ 0 holds. Then, by construction of x L , we can deduce x L ≥ 0. Thus, x L0 is equal to x L and, as a nonnegative vector, satisfies L ∞ (x, x Lc ) ≤ L ∞ (x, x L0 ). Since introducing constraints to a problem cannot lead to a better value of the objective function in the optimum, it must hold that L ∞ (x, x Lc ) ≥ L ∞ (x, x L ) = L ∞ (x, x L0 ). Together this yields the result restricted to the case min(x) ≥ 0.
We now consider the case min(x) < 0. Since the negative values of x L set to zero in x L0 result in a maximal deviation of − min(x) to x, we get L ∞ (x, x L0 ) = max(L ∞ (x, x L ), − min(x)). Also, min(x) < 0 and x Lc ≥ 0 imply L ∞ (x, x Lc ) ≥ − min(x), so that we deduce where the last step follows, because a constraint problem cannot lead to a solution with a better value of the objective function compared to the corresponding unconstrained problem. Thus, follows from the definition of x Lc , since x L0 ≥ 0. Both together yield the result restricted to the case min(x) < 0, which completes the proof.
Using the above results we prove the correctness of our algorithm in the following theorem.
Theorem 1 Algorithm 1 returns s opt and a corresponding solution H opt both optimal in the sense of Problems 1 and 2, respectively.
Proof Lemma 1 shows that for s > s opt no mixture can lie within the confidence band B while for s ≤ s opt there always exists a mixture lying in B. By the monotonicity of this property the binary search step converges to s opt . Lemmas 3 and 4 allow to update the upper bound of the binary search by the values of the shrinkage factor after each Shrink-Down and Push-Up step. Hence, these steps further reduce the range of possible candidates for s opt , while never excluding s opt and therefore the correct s opt is still determined. Lemma 2 implies, that the correcting function H k after termination of the main loop of Algorithm 1 is the function H min introduced on page 6, which is required for solving Problem 2. Having found the set M norm in the lines 5-10 of Algorithm 5, we use Lemma 5 to see that the corrected solution to the unconstrained L ∞ isotonic regression problem is an optimal solution to the constrained problem. Thus, it is a valid solution H opt , which completes the proof.

Runtime analysis
For the runtime analysis we introduce a precision parameter ε. Note that ε never appears in our pseudo code or actual implementation. Instead, think of it as the machine precision, which might depend on the physical architecture, operating system or programming environment. Note that the main loop of Algorithm 1 in lines 5 to 11 runs until the mixture F is in the confidence band up to an additive deviation of ε. In other words, the loop stops when for all z ∈ Z the property L(z) − ε ≤ F(z) ≤ U (z) + ε holds. In the following theorem we prove that this condition is met after a constant number of iterations yielding an overall running time linear in the input size and logarithmic in 1 ε . Note that we exclude the Ω(n log n) time needed for computing the cumulative distribution functions by assuming sorted input data, but rather focus on the linear running time of the actual analysis. iterations the size of the interval decreases to at most 2 − log 2 2 ε < ε 2 . So, after log 2 2 ε iterations every value between the upper and lower boundary lies within additive precision ε 2 to s opt . Consider an s ∈ [s opt − ε 2 , s opt + ε 2 ] and let H s ∈ M * be the minimal function such that s · G e + (1 − s) · H s ≥ L holds. Using s ≥ s opt − ε 2 we see that s · G e ≥ s opt − ε 2 · G e = s opt · G e − ε 2 · G e ≥ s opt · G e − ε holds. The property s · G e ≥ s opt · G e − ε 2 implies (1 − s) · H s ≤ (1 − s opt ) · H opt + ε 2 and we deduce because s opt · G e + (1 − s opt ) · H opt ≤ U holds by definition of s opt and H opt . An analogous argument shows s · G e + (1 − s) · H s ≥ L − ε. Thus, the stopping criterion L − ε ≤ F ≤ U + ε is met after log 2 2 ε iterations and the result follows.

Application
In this section we evaluate the algorithm by applying it to simulated and real data sets. We compare the running time of our algorithm to alternative procedures on artificial data, investigate its capability to estimate the disagreement of the distributions for finite Gaussian mixtures and examine its performance in case of false rejections of the null hypothesis. Furthermore, the algorithm is illustrated on astrophysical data and on spectrometry data from a biological domain. All methods conducted in this section are carried out using the significance level α = 0.05.
For our empirical evaluation we implemented all algorithms using the statistical software R (R Core Team 2013), version 2.15.1-gcc4.3.5. To run the experiments in a batch and to distribute the computations to the cores of our computer the R-package BatchExperiments by Bischl et al. (2013) was used. The computations were conducted on a 3.00GHz Intel Xeon E5450 machine with 15GB of available RAM running a SuSE EL 11 SP0 Linux distribution.
Note that our algorithm represents the determined correction distribution by its cumulative distribution function. However, since densities and the first two moments allow to capture the main features of a distribution more intuitively, we present our results via estimated densities and empirical moments rather than using the determined correction distribution functions itself. To attain those, we first determine the empirical density function corresponding to the determined correction distribution function by considering consecutive differences of H opt . We then generate 10,000 artificial observations from this density using weighted sampling. Finally, the standard kernel density estimator as well as the empirical mean and the empirical variance are computed on this artificial data. Note that in applications this approach is not mandatory, since the determined distribution function contains all relevant information available. Therefore, improving simulations based on this distribution function directly is perfectly fine in practice and there is not absolutely necessary to introduce the density and moment estimation, which we conduct for the purpose of presentation only. Thus, the kernel density estimation and the moment estimation are not regarded as part of our method. To asses their effect the interested reader is referred to Serfling (1980) and Devroye and Gyrfi (1985).

Performance and runtime evaluation
In order to evaluate the algorithm output we first consider the popular setting of finite Gaussian mixtures. We generate equally sized dataset pairs for each of the sample sizes n 1 = n 2 = 100, 500, 1000, 5000, 10,000, 50,000, 100,000. In every data set pair one sample is drawn from a standard Gaussian distribution. The other also consists of observations from the standard Gaussian distribution to a fraction of s = 0.3. The remaining fraction of 0.7 of the observations stem from a second Gaussian distribution with mean 3 and standard deviation 1. Our demixing algorithm is therefore supposed to notice the different distributions of the samples, estimate a mixing proportion of about 0.3 and recommend a correction distribution with a mean near 3 and a standard deviation near 1.
We also investigate another data case for the same sample sizes and n 1 = n 2 , in the following referred to as the zero mixing data scenario. It is more specific and resembles some of the situations encountered in our real data application. Instead of mixing two Gaussian distributions, we set the constant value 0 for 70 % of the observations in each mixed data set. The remaining 30 % are sampled from the Gaussian distribution with mean 3 and standard deviation 1. The corresponding second sample representing the simulated data consist of observations of the same Gaussian distribution entirely.
In this setting the method should again determine a shrinkage value s opt around 0.3 and propose a correction distribution putting most of its probability mass at 0. Both scenarios are replicated 1000 times for each of the sample sizes, respectively.
We study the running time of Algorithm 1 by comparing it to a simpler demixing approach. The algorithm, called binary search procedure in the following, determines the optimal shrinkage factor s opt relying only on the binary search. In contrast to Algorithm 1, the Shrink-Down and Push-Up steps are not conducted. Keep in mind that both steps are in principle not necessary to obtain the correct solutions to Problem 1 and 2 but are supposed to accelerate the computation. Thus, the determined s opt and H opt are identical for both methods, but the running times differ. Table 1 shows the results for both data cases described above. In the Gaussian mixture setting we list the determined shrinkage factors s opt as well as the mean and standard deviation of samples of size 10,000 drawn from the determined correction distribution H opt for each sample pair. The second half of the table corresponds to the zero mixing data scenario. In addition to the determined shrinkage factors s opt we give the probability which is assigned to the value 0 by the determined distribution functions H opt . The corresponding running times for both algorithms are presented in Fig. 1 for the Gaussian mixture case in seconds. Thereby, the time needed for precomputing the empirical distribution functions is not included. For the sake of presentation we do not show the running times for the two largest sample sizes n 1 = n 2 = 50,000 and n 1 = n 2 = 100,000. These were 1.18 and 2.47 s, respectively, for Algorithm 1 and 6.2 and 12.33 s, respectively, for the binary search procedure. We also omit the running times for the zero mixture case, which are essentially the same as in the Gaussian mixture case. All results are averages over 1000 repetitions.
The results for both simulations suggest that demixing leads to an overestimation of the expected mixing proportion 0.3, which decreases as the sample size increases. This is not surprising, since by definition s opt is the maximal shrinkage factor such that the corresponding mixture lies in the confidence band. Therefore, as the sample size grows, the radius of the confidence band becomes smaller and hence s opt converges towards s. The estimated mean and standard deviation in the Gaussian mixture case behave similarly approaching 3 and 1, respectively. In the zero mixing data scenario even for small sample sizes an overwhelming majority of the probability mass in H opt is assigned to the value 0. This is correct, since by construction the differences between the sample pairs are caused by the zero values only. Thus, the correction distributions proposed by the methods reflect the discrepancies between the sample pairs quite well. The running time for both algorithms grows linearly in the sample size, given sorted input data, and is by a factor of approximately 6 smaller for Algorithm 1 than for the binary search procedure. This holds for both data cases and shows that the Shrink-Down and Push-Up steps lead to large savings in computation time and are therefore very valuable for large data sets.
In Fig. 2 we illustrate the output in the Gaussian mixture case for n 1 = n 2 = 1000. In the upper row, kernel density estimations of the provided samples are presented. Demixing the samples using Algorithm 1 leads to the shrinkage factor s opt = 0.39, which is a reasonable approximation of the true mixture proportion s = 0.3. Using the approach explained above we generate a third sample with 10,000 observations from the correction distribution characterised by H opt . Its mean 3.3 and standard deviation 0.81 are also close to the desired values 3 and 1, respectively. The corresponding kernel density estimation shown on the right in the lower row is almost symmetrical and unimodal. Hence, the correction distribution represents the deviation between the underlying distributions of the first and the second sample quite well. The final mixture distribution proposed by Algorithm 1, which is the sum of the weighted distribution of the second sample and the weighted correction, is given by the corresponding estimated density in the lower left corner. The curve resembles the one of the first sample as desired.

Estimated shrinkage factors under the null hypothesis
Under the null hypothesis H 0 , the analysed samples stem from the same distribution. In this situation, the Kolmogorov-Smirnov test rejects by mistake in about an α-fraction of the cases, where α is the predefined significance level. In these cases, a reasonable procedure comparing the samples in the mixture framework should recognise their similarity. Thus, a shrinkage factor near 1 is desirable after a false rejection of the null hypothesis.
To check the performance of our method under H 0 , dataset pairs are generated for the sample sizes n 1 = n 2 = 100, 500, 1000, 5000, 10,000. All samples stem from the standard Gaussian distribution. Other distributions like the exponential and the tdistribution were also considered and led to comparable results. For each sample size, dataset pairs are simulated until the Kolmogorov-Smirnov test rejects in 1000 cases. These 1000 dataset pairs are passed to Algorithm 1, which determines corresponding shrinkage factors. These are presented via boxplots in Fig. 3. All of them are less than 1 by construction. As the results show even for small sample sizes the majority of shrinkage values are greater than 0.9. Increasing the sample size further reduces the amount of small shrinkage values. Thus, our method performs as desired: If no modifications are actually necessary, the algorithm proposes to perform none or only small modifications to the current samples.

Application to real data from astrophysics
We now illustrate our Algorithm 1 applying it to the astrophysical problem mentioned in the introduction. More precisely, we consider simulated proton data and compare it to observations recorded by the gamma ray detectors MAGIC-I and MAGIC-II. The latter are almost completely induced by protons. Both datasets consist of 5 000 observations and contain 54 continuous attributes that we work with. Among other features, these variables mainly include characteristics of the recorded atmospheric signals and their reconstructed trajectory and are identical for both data sets. The results of our method allow to determine attributes which differ the most for simulated protons and observed data and to quantify their discrepancies. This information can subsequently be used to improve the background simulation.
The Kolmogorov-Smirnov test comparing the real data and the simulation rejects the null hypothesis of identical distributions for all but two attributes. However, 37 of the 54 attributes have shrinkage factors above 0.85, which indicates a suitable proton simulation overall. The upper row of Fig. 4 provides kernel density estimations for the observed and simulated data for the attribute Length1, which describes the length of an ellipse fitted to an atmospheric signal measured by the MAGIC-1 detector. The Kolmogorov-Smirnov test for Length1 rejects the null hypothesis of equal distributions of observed and simulated data leading to a comparably low shrinkage factor of 0.75. Therefore, the simulation of this variable might be inadequate and the corresponding simulation steps seem to be worth inspecting in more detail. In the lower right corner a kernel density estimation is presented for the correction distribution characterised by H opt . It is based on 10 000 observations generated by the The coarse form of the density estimations for the observed data and the simulation in the upper row is quite similar showing one major peak around 25. However, there are some slight discrepancies. Compared to the real data curve, the main peak of the simulation is considerably higher. While the curve for the real data has a plateau around 90, we have a steadily falling curve for the simulation. Although these differences are not very large, it is quite unlikely that they are induced by the sample variance due to the large sample sizes. In order to verify this hypothesis, we have conducted several simulations considering kernel density estimators for a broad class of distributions using 5 000 observations in each sample. The dissimilarities in these simulations were much smaller than for the Length1 attribute supporting the conjecture that the Kolmogorov-Smirnov test correctly rejected the null hypothesis.
In order to correct the simulated sample, one should obviously generate less observations around 25 and more around 90. This is exactly what is proposed by the corrected distribution presented in the lower right corner of Fig. 4, which reflects the Length1 distribution of non-proton events. The corresponding density curve based on the estimated H opt has a peak near 25, but also another one of comparable height and greater width near 90. Therefore, it gives the region around 90 about as much weight as the one around 25, in contrast to the simulated sample.
Combining the simulated and the correction distributions in the proportions determined by the algorithm, we get the density curve of the final mixture presented in black in the lower left plot of Fig. 4. It resembles the density estimation for the observed data above it quite well. On the one hand, the height of the main peak is corrected, which is achieved by the shrinking. On the other hand, the required plateau is introduced to the mixture by the correction density.

Application to real data from bioinformatics
Algorithm 1 is applied to evaluate so called ion mobility spectrometry (IMS) measurements that help to detect volatile organic compounds in the air or in exhaled breath. Motivated by the need to process such measurements in real-time as they arrive oneby-one, it is a usual approach to find and annotate major peaks in the data. In this way the original information is summarised in a compressed representation. In an effort to automate and speed-up the computations, D' Addario et al. (2014) propose to approximate the measurements by finite mixtures of probability density functions, whose parameters are estimated using a variant of the EM algorithm. The computations are performed on a sequence of measurements leading to a two dimensional problem, where both dimensions are modelled independently by mixtures of inverse Gaussian densities.
Focusing on one of the dimensions and conditioning on the other, we are given 6000 spectrograms consisting of 12,500 data points each, stemming from 10 minutes of IMS measurement (cf. Kopczynski et al. 2012). In this data, we identify 187 groups of spectrograms belonging to the same peak models, respectively. We take samples of size 1000 generated from each spectrogram and the corresponding mixture model given from the bioinformatics algorithms. Both of these are regarded as probability density functions up to some normalising constants. In order to evaluate their models, we apply our algorithm at a significance level of five percent. Hereby we inspect the discrepancies to the samples taken from their corresponding spectrograms.
In general our algorithm suggests that the models fitted by the bioinformaticians approximate their spectrograms reasonably well, since in 152 of the 187 groups the mean shrinkage factor for the spectrograms is above 0.8. In addition, we identify some interesting groups of spectrograms. The shrinkage factors of two of these are shown in Fig. 5. Keep in mind that the spectrogram index represents the second dimension of the data we condition on. In both groups the model in the second dimension consists of a single inverse Gaussian density.
Our algorithms' results for group A suggest that the first half of the measurements are modelled quite well by the bioinformaticians' EM algorithm, but for increasing spectrogram indices the approximation is getting worse. This shows that the bioinformatics model in the second dimension is not appropriate. Instead of a single inverse Gaussian density, two components would probably lead to better approximations, since they allow to model both halves of the spectrograms with a density function, respectively. In contrast to that, the shrinkage factors for group B indicate a sufficient number of components used in the second dimension. For the spectrograms in the middle we have shrinkage factors close to one. This means that their corresponding models are close to the spectrograms. However, going to the left and right borders, the spectrograms seem to be fitted quite badly since the shrinkage factors are lower than 0.2. The two leftmost and two rightmost models are a little closer to their spectrograms with shrinkage values between 0.4 and 0.6. Taking the models of Kopczynski et al. (2012) into account this indicates that their fitted density mixture might be too wide or too narrow in the second dimension. Thus, the approximation could be substantially improved by excluding the two spectrograms on both margins, respectively, from this group and treating them by further models.
We also illustrate the method using a single spectrogram from the data set. The upper row of Fig. 6 provides a kernel density estimation for the measurement 1157 and its model. Since all four plots are given on the same scale, the two peaks in the model are more narrow and differ much more in height than the ones in the original data. In addition, the peak on the left is missing. Although it looks small in this scale, it appears noteworthy when compared to the other two. In the second row on the right a kernel estimation for the correction distribution characterised by H opt is presented. It is based on 10 000 observations generated by the sampling approach described on page 22. As expected, the correction distribution puts mass on the very right peak in order to fix the height proportions between the peaks on the right. In addition it generates the left peak missing in the model. The plot in the lower left corner shows the estimations of the modelled and the correction distribution weighted by the determined shrinkage value 0.76 and the remaining mass 0.24, respectively, as well as the kernel estimation for the final mixture, which is the sum of the weighted estimations. The proposed mixture is still somewhat narrow, but the proportions of the peak heights as well a the small peak are better represented compared to the original model.

Conclusion
This article deals with the nonparametric two sample homogeneity problem. A widelyused tool to test the equality of the distributions corresponding to the two samples is the Kolmogorov-Smirnov test. We develop an algorithm which, in case of a rejection by this test, determines how the first sample should be modified to resemble the second one in the Kolmogorov-Smirnov sense. This modification is quantified by an empirical correction distribution and the corresponding proportion determined by our method. Combining the information of the first sample with this correction in the proportion identified leads to an appropriate mixture. Its Kolmogorov-Smirnov distance to the second sample is so small that the test would not reject the hypothesis of equal distributions any more. Our method is especially of interest as an assisting tool in the task of designing an adequate simulation modelling an observed data generating process. Comparing a sample from an existing simulation based on domain specific prior knowledge and a sample of observed data, our algorithm determines a correction distribution. The information provided therein may subsequently be used to improve the current simulation. Since the procedure is completely nonparametric, it is widely applicable and in particular not only useful in the settings considered in our simulations. The algorithm proceeds in an iterative manner applying several correction steps linked with a modified binary search technique. The constructed distribution function is shown to be optimal in a reasonable sense and the running time of the algorithm is proved to be of linear order on sorted data. The Shrink-Down and Push-Up steps applied in addition to the standard binary search algorithm lead to large savings in computation time. In our experience it converges in three iterations in the majority of all cases independent of the sample sizes. The algorithm proposes none or only slight corrections in cases where both datasets stem from the same distributions. The correction distributions proposed in simulations as well as for a real data example are intuitive and adequate.
There are several possibilities to extend the presented ideas in future work. On the one hand, instead of focusing on distribution functions, a density based approach to the demixing problem could also be of interest. Working with densities is often even more intuitive than using distribution functions and there exists a broad literature on mixture models dealing with density estimation (e.g. Schellhase and Kauermann 2012). However, as pointed out in the introduction, transferring the two sample problem to the density framework is not straightforwardly achieved by applying the existing techniques and may be computationally more demanding, so that much work has to be done here. On the other hand, one could use alternative test procedures for distribution functions besides the Kolmogorov-Smirnov test to construct the confidence bands. Although the Kolmogorov-Smirnov test is quite popular, competitors like the Anderson-Darling and the Cramér von Mises test detect differences between two distributions more often in certain settings (cf. Razali and Wah 2011) and could thus lead to better demixing results. In this work we focused on the Kolmogorov-Smirnov test since the simple shape of the corresponding confidence band allows for finding an efficient algorithm solving the demixing problem. The extension to analytically more sophisticated distance measures where our proofs do not carry over in a straightforward manner is a challenging and promising open problem for future research. Another direction is to generalize our method to distributions over multi-dimensional domains based on appropriate extensions of the Kolmogorov-Smirnov test. Several multivariate versions of the test are surveyed by Lopes et al. (2007) and could serve as a starting point towards extending our method to the multivariate setting.