Abstract
Providing access to synthetic microdata in place of confidential data to protect the privacy of participants is common practice. For the synthetic data to be useful for analysis, it is necessary that the density function of the synthetic data closely approximate the confidential data. Hence, accurately estimating the density function based on sample microdata is important. Existing kernelbased, copulabased, and machine learning methods of joint density estimation may not be viable. Applying the multivariate moments’ problem to samplebased density estimation has long been considered impractical due to the computational complexity and intractability of optimal parameter selection of the density estimate when the true joint density function is unknown. This paper introduces a generalised form of the sample momentbased density estimate, which can be used to estimate joint density functions when only the information of empirical moments is available. We demonstrate optimal parametrisation of the momentbased density estimate based solely on sample data by employing a computational strategy for parameter selection. We compare the performance of the momentbased estimate to that of existing nonparametric and parametric density estimation methods. The results show that using empirical moments can provide a reasonable, robust nonparametric approximation of a joint density function that is comparable to existing nonparametric methods. We provide an example of synthetic data generation from the momentbased density estimate and show that the resulting synthetic data provides a reasonable disclosureprotected alternative for public release.
1 Introduction
Disseminating synthetic data in place of confidential microlevel data can be an efficient way to protect the privacy of survey and clinical trial participants. Whether a set of synthetic data is ideally suitable in representing the statistical information from a confidential data source depends upon the similarity of the density function of the set of synthetic data to the density function of the confidential data source. Therefore, having a reliable and robust method for estimating the underlying density function is essential in the synthetic data generation process.
Many existing nonparametric density function estimation methods, including kernel density estimation, rely heavily on local area estimation. The lowdensity local areas used to construct the density function estimate of a sample space may contain very few sample observations. The values of the sample observations in those areas are more easily identified or reconstructed later from the constructed density function estimate, posing greater risk to privacy and confidentiality. Data agencies can offset the risk of disclosure by defining local areas of greater size (larger bandwidth) or through perturbation, but this may adversely impact statistical utility. Therefore, those methods relying on local area estimation may not be suitable when confidential data is involved. Existing parametric maximum likelihood and mixture distribution approaches often lack sufficient robustness to estimate atypical distributions effectively. Researchers tend to opt more for kernel estimators, maximum likelihood estimators, or increasingly, nonparametric machine learning techniques (Izenman 1991; Wang and Scott 2019). While highly effective in varying contexts, these methods have limitations, especially when dealing with atypical, incomplete data. Moreover, circumstances in which the data agency may restrict access to microlevel data will result in datadriven nonparametric approaches becoming untenable.
This paper proposes a computational approach for estimating a joint density function of continuous data. The estimated density function is expressed as a finite linear summation of orthogonal polynomial functions and only relies on sample moments. Without knowledge of the joint density function underlying the continuous data, this paper develops a feasible computational strategy that searches for the appropriate maximum order of polynomials used in the estimated density function. We show that the momentbased density approach achieves a compromise between nonparametric and parametric methods, offering a nonparametric global estimate that is sufficiently robust. We also provide an example application of momentbased density estimation in producing synthetic continuous microdata on confidential microdata for disclosure protection.
Although our primary motivation, the application of momentbased density estimation is not limited to use cases for confidential data. Momentbased density estimation has interesting applications in imputation for missing data more broadly and for the estimation of continuous density functions from tabular data. Furthermore, there has been recent developments in applying momentbased density approximates in stochastic models for finance (Stoyanov 2016) and image reconstruction (Mustapha and Dimitrakopoulos 2010).
Constructing unique polynomialbased series expansions for continuous functions is a common practice applied in a range of areas of mathematics, physics engineering, and statistics (Munkhammar et al. 2017). Polynomialbased expansions have been used in studies to approximate probability density functions on both closed and infinite domains (Elderton and Johnson 1969; Solomon and Stephens 1978; Kendall et al. 1987; Reid 1988; Provost 2005). Known more commonly as the moments’ problem, this area of study has been considered from as far back as the 19th century (Gram 1883). The advent of greater computational power has increased the usage of polynomial expansions. Currently, however, polynomialbased expansions are rare in multivariate density estimation applications, when compared to their use for univariate estimation. As noted by Kleiber and Stoyanov (2013), the results for the multivariate case tend to be scattered throughout specialised journals and focus on an analytical perspective.
This paper focuses on distributions with bounded marginal domains to ensure determinacy. Determinacy, and more specifically moment determinacy, refers to the existence of a unique momentbased expansion (Kleiber and Stoyanov 2013). This paper requires no specific choice of the associated reference joint density (weight) used. The restriction to a bounded domain is certainly not necessary if moment determinacy can be assured. All results presented in this paper can be generalised on unbounded domains in certain situations. Unbounded reference joint densities shown in the literature to have moment determinacy include the normal and gamma distributions (Provost 2005). Furthermore, as demonstrated in Lin (2014), which applies a univariate form of momentbased density estimation to recover information from perturbed data, assuming a compact domain can result in a reasonable estimate for density functions that exist on an unbounded domain provided the sample size is sufficiently large enough.
This paper is structured in six sections. Section 2 provides the analytical results required to construct the momentbased density estimate. Section 3 considers a measure of goodness of fit for the momentbased density estimate and outlines a samplebased strategy for parameter selection. In Sect. 4, we perform a simulation study to compare the performance of the momentbased density estimation method with existing joint density estimation methods. In Sect. 5, we outline a process of simulating synthetic microdata from the momentbased density estimate and demonstrate this application to hypothetical confidential data. We conclude our paper in Sect. 6.
2 The multivariate momentbased density estimate
Throughout the literature, researchers have mainly focused on the usefulness and effectiveness of a particular orthogonal polynomial basis in regards to estimating certain distributions, selecting classical polynomial families as the basis for the expansion (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). These classical polynomials are usually chosen based on the resemblance of their weight functions (derivative of the positive Borel measure by which the polynomials are orthogonal) to the canonical distributions. Orthogonal polynomials for which a density approximation has been derived in the literature include Hermite (Edgeworth and GramCharlier Expansions), Legendre, Jacobi and, Laguerre Polynomials (Gram 1883; Edgeworth 1905; Charlier 1914; Szegő 1939; Cramer 1946; Alexits 1961; Kendall et al. 1987; Provost 2005). They have weight functions resembling normal, uniform, beta, and gamma distributions, respectively. These polynomials can all be used to provide momentbased expressions of certain density functions and are eigenfunctions of a hypergeometric type secondorder differential equation (SánchezRuiz and Dehesa 1998).
Multivariate orthogonal polynomial expansions, particularly for Hermite, Jacobi, and Laguerre polynomials, have been considered in the literature (Xu 1997; Griffiths and Spano 2011). Joint density expressions for quasimultivariate normal distributions have also been developed using a Hermite polynomialbased expansion (Mehler 1866; Withers and Nadarajah 2010). This section derives a generalised form of the multivariate density estimate and the fundamental mathematical results required to apply this density estimate in practice. This form covers many existing approaches and allows for greater flexibility in the choice of reference density and resulting orthogonal polynomials used.
2.1 The orthogonal polynomial basis
Initially, we only consider the orthonormal basis for the univariate case in a singledimensional space \(\Omega = [a,b]\). Later this basis is expanded to the Ndimensional compact space \({\Omega }^{(N)} = \prod _{k=1}^N [a_k, b_k]\). Let \( f_{\nu }\) denote the probability density function of a random variable \(\nu \) with the domain the closed interval \(\Omega = [a,b]\). We refer to \( f_{\nu }\) as the reference distribution function, and this function will serve as the weight function on an orthonormal basis.
2.1.1 Univariate polynomial basis
Regarding the function \(f_{\nu }\), the results of SánchezRuiz and Dehesa (1998) can be summarised in the following theorem.
Theorem 1
If there exist polynomial functions \(\sigma (x)\) and \(\tau (x)\) of degree no greater than 2 and 1 with coefficients \(\phi _0,\phi _1,\phi _2\) and \(\psi _0,\psi _1\) respectively for \(f_{\nu }\), where,
such that
with boundary conditions, for \(i = 0,1,\ldots \)
then there exists a set of eigenfunction solutions \(\{P_n(x)\}_{n \in \mathbb {N}_0}\), where \(\mathbb {N}_0\) denotes the set of all natural number including 0, of the hypergeometrictype differential equation,
\(P_n(x)\) denotes the nth eigenfunction solution corresponding to \(\lambda _n = n\psi _{1} + n(n1)\phi _{2}\) and has the following properties:

(i)
$$\begin{aligned} \lambda _n P_n(x)f_{\nu }(x) = \frac{d}{dx} \left[ \sigma (x)P_n'(x)f_{\nu }(x) \right] . \end{aligned}$$

(ii)
\(P_n(x)\) is a polynomial of degree n with Rodrigues’ formula of the form
$$\begin{aligned} P_{n}(x) = \frac{B_{n}}{f_{\nu }(x)} \frac{d^{n}}{dx^{n}} \left[ (\sigma (x))^{n} f_{\nu }(x) \right] , \end{aligned}$$where \(B_{n} \in \mathbb {R}\) is the normalising constant in the Rodrigues’ formula and can be chosen arbitrarily in order to show consistency with classical orthogonal polynomial families.

(iii)
Given \(B_n\) of the Rodrigues’ formula is defined as follows,
$$\begin{aligned}{} & {} B_0 = 1\quad \text {and} \nonumber \\{} & {} B_{n} = \sqrt{ \frac{(1)^n \prod _{i=0}^{n1} \left[ \frac{1}{ \psi _1 + (n + i 1)\phi _{2} }\right] }{ n! \int _{a}^{b} (\sigma (x))^{n}f_{\nu }(x) dx}, } \quad \text {for}\; n = 1,2,\dots , \nonumber \\ \end{aligned}$$(4)then,
$$\begin{aligned} \int _{a}^{b} P_{n}(x) P_{m}(x) f_{\nu }(x) dx = \delta _{n,m}, \end{aligned}$$where \(\delta _{n,m}\) denotes the Kronecker delta, and \(n, m \in \mathbb {N}_0\). Hence \(\{P_n(x)\}_{n \in \mathbb {N}_0}\) are orthogonal polynomials.
Theorem 1 shows that provided an appropriate quadratic polynomial \(\sigma (x)\) and appropriate linear polynomial \(\tau (x)\) for a given reference density \(f_\nu \) can be found; then, we can obtain the aforementioned set of polynomials. Theorem 1 makes no statement about the domain of the underlying distribution and can therefore be applied to reference distributions defined on both bounded and unbounded domains. Many of the standard probability densities used in statistics meet this criterion as a reference density. Table 1 provides examples of \(\sigma \) and \(\tau \) functions for some of these densities and the connection of the resulting polynomials to the classical orthogonal polynomial families.
In the following, we list some properties of the orthogonal polynomials without specifying the distribution of the reference function \(f_{\nu }\). These properties will benefit the computational approach for momentbased joint density function estimation later developed in this paper.
Property 1
Regarding the polynomial function \(P_n(x)=\sum _{i=0}^{n}a_{n,i}x^i\), the coefficient \(a_{n,i}\) for fixed \(n \in \mathbb {N}_0\) is given by,
where \(\kappa _{n} = B_{n} \prod _{i=0}^{n1} \left[ \psi _1 + (n + i 1)\phi _{2} \right] \) and \(\kappa _{0} =1\).
We note that \(\sigma (x)\) and \(\tau (x)\) are polynomials of degree no greater than 2 and 1, with coefficients \(\phi _0,\phi _1,\phi _2\) and \(\psi _0,\psi _1\) respectively (1).
Property 1 provides the Frobenius solutions to the polynomial coefficients for each of our polynomial terms. We note that this solution also applies to reference densities on the unbounded domain and can be used to derive the coefficients of all the classical polynomial expressions given in Table 1.
Denote \(\mu _{\nu }\) a measure on (\(\Omega ,\mathscr {F}\)) with \( \mu _{\nu }(A) = \int _{A} f_{\nu }(x) dx\) for \(A \in \mathscr {F} \) , where \(\mathscr {F}\) is a \(\sigma \)field on \(\Omega \).
Property 2

(i)
The integral of polynomial \(P_n(x)\) with respect to the measure function \(\mu _\nu \) is
$$\begin{aligned} \int P_n(x) d\mu _\nu&= \int P_n(x) f_{\nu }(x) dx \nonumber \\&={\left\{ \begin{array}{ll} \frac{\sigma (x)}{\lambda _n}P_n'(x)f_{\nu }(x) + c, &{} n = 1 ,2, \dots \\ F_{\nu }(x) + c, &{} n =0 \end{array}\right. }, \end{aligned}$$(6)where \(c \in \mathbb {R}\) is some constant and \(F_{\nu }\) is the cumulative distribution function of \(f_\nu \).

(ii)
$$\begin{aligned} \lim _{t\rightarrow a} \sigma (t)P_n'(t)f_{\nu }(t) = 0. \end{aligned}$$
Property 2 expresses the indefinite integral of the orthogonal polynomials with respect to the reference density. This expression is essential in the practical application of the density estimate as it allows for the analytical derivations of the distribution functions associated with the multivariate density estimate. However, due to length constraints, we will not give the expressions of all the associated distribution functions in this paper. Property 2 will be used in Sect. 5, to derive the expression of the momentbased conditional distribution function estimate. One of the key benefits of the polynomialbased approach to density estimation is the analytical flexibility and applicability of the resulting density estimate. Provided the reference density is continuous and differentiable, then so too is the resulting momentbased density expression and thus is more easily applied in inversion and optimisation problems.
Property 3
SzökefalviNagy 1965\(\{P_n\}_{n \in \mathbb {N}_0}\) forms a complete orthogonal base in the \(L_2\) space \(L_2(\Omega ,\mu _\nu )\).
Therefore, polynomials \(\{P_n\}_{n \in \mathbb {N}_0}\) can serve as basis functions when expanding any continuous functions in \(L_2\) space \(L_2(\Omega ,\mu _\nu )\). Given this property, we extend the complete orthogonal basis to a multivariate space in the following section.
2.1.2 A multivariate polynomial basis
Consider a Ndimensional compact space given by \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\). From this point on, we will consider the orthogonal polynomial bases on \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\).
Let \( f_{\nu } = \prod _{k=1}^N f_{\nu _k} \). Here \(f_{\nu _k}\) is a probability density function of a random variable \(\nu _k\) with the domain the closed interval \(\Omega _k = [a_k,b_k]\), \(k=1,\dots ,N\). Random variables \(\{\nu _k\}_{k=1,\dots ,N}\) are mutually independent. We refer to \( f_{\nu }\) as the reference joint density function and this function will serve as the multivariate weight function in our orthonormal basis. We note that \(f_{\nu }: {\Omega ^{(N)}} \rightarrow \mathbb {R}\) is the joint density function of the random vector \(\nu =(\nu _1, \ldots , \nu _N) \).
Of interest is the \(L_2\) space \(L_2({\Omega ^{(N)}},\mu _{\nu })\) with measure function \(\mu _{\nu }\) defined by
where \(\mathscr {F}\) is the \(\sigma \)field of \(\Omega ^{(N)}\). By taking the product of the univariate orthogonal polynomials described in Sect. 2.1.1 we obtain a set of multivariate orthogonal polynomials defined on \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k,b_k]\). These multivariate polynomials are given as follows.
Definition 1
Let \(\{P_{k, n_k}\}_{n_k \in \mathbb {N}_0}\) the set of hypergeometric type polynomials given by the density function \(f_{\nu _k}\), \( k=1, \dots , N\). The multivariate hypergeometric type polynomial \(P_{n_1,\dots ,n_N}(\textbf{x})\) associated to \(f_{\nu }\) is given as the product of N hypergeometric type polynomials \(P_{k,n_k}\) of degree \(n_k\) for each \( k = 1,\dots ,N\), that is
where \(\sigma _k\) and \(\tau _k\) refer to the polynomials described in (2) of the kth hypergeometric type polynomial and \(B_{k,n_k}\) refers to the kth normality constant described in (1) and (4).
We can subsequently leverage the construction of the multivariate polynomial as a product of marginal univariate polynomials to decompose the multivariate polynomial into its marginal components. In doing so, the completeness and orthogonality of the univariate marginal polynomials ensure the multivariate polynomials defined in Definition 1 are themselves orthogonal and complete in the Ndimensional space. As such, it can be said these multivariate polynomials form a complete orthonormal polynomial basis for the Ndimensional \(L_2\) product space \(L_2(\Omega ^{(N)},\mu _{\nu })\). This statement is given more formally in Theorem 2.
Theorem 2
The multivariate hypergeometrictype polynomials are orthonormal on the compact space \({\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k\) and complete in the \(L_2\) space \(L_2({\Omega ^{(N)}},\mu _{\nu })\).
A detailed proof of Theorem 2 is given in “Appendix A”.
2.2 Multivariate momentbased density function
Let \(\textbf{X} = (X_1,\dots ,X_N)\) be a random vector with joint density function \(f_\textbf{X}\) on \({\Omega ^{(N)}}\). The joint density function \(f_\textbf{X}\) can be uniquely expanded into a polynomial expression, multiplied by the reference density \(f_\nu (\textbf{x})\), by use of orthonormal basis functions \(\{ P_{n_1,\dots , n_N}(\textbf{x}) \}_{n_1,\dots ,n_N \in \mathbb {N}_0} \) on the compact space \(\Omega ^{(N)}\). The resulting density expression has polynomial coefficients, which are functions of the underlying moments of the distribution. This form of the joint density is referred to in this paper as the multivariate momentbased density (MBD) expression.
We begin with the statement of the theorem outlining the multivariate momentbased density expression.
Theorem 3
The Multivariate MomentBased Density.
Let \(\textbf{X} = (X_1,\dots ,X_N)\) be a random vector with joint density function \(f_\textbf{X}\) on space \({\Omega ^{(N)}} = \prod _{k=1}^N [a_k, b_k]\). Assume \(\textbf{X}\) has finite moments \(\int _{\Omega ^{(N)}} x_1^{i_1} \dots x_N^{i_N}f_{X}(\textbf{x}) dx_1\dots dx_N \) for all \(i_1,\dots ,i_N \in \mathbb {N}_0\), and
\(f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })\). Then \(f_{\textbf{X}}\) can be expressed as
where the coefficient terms \(C_{n_1,\dots ,n_N}\) are given by
where \(a_{n_1,\dots ,n_N ; i_1, \dots , i_N} = \prod _{k=1}^N a_{n_k,i_k}\) and \(a_{n_k,i_k}\) is the \(i_k\)th coefficient of the \(n_k\)th polynomial \(P_{k,n_k}(x_k)\).
Proof of Theorem 3 can be found in “Appendix B”.
Remark
We note \(\sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2 = \mathbb {E}_{\textbf{X}} \left[ \frac{f_{\textbf{X}}(X_1,\dots ,X_N)}{f_{\nu }(X_1,\dots ,X_N)} \right] < \infty \) as \(f_{\textbf{X}} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu }).\) It also follows that \(C_{n_1,\dots ,n_N} \rightarrow 0\) as \(n_1,\dots ,n_N \rightarrow \infty \).
The expansion given in Theorem 3 relates to a jointdensity defined on a compact space. We imposed this restriction to ensure moment determinacy, as, on unbounded spaces, we cannot guarantee moment determinacy. Polynomials given by (8) and applied in Theorem 3 may still be extended to unbounded spaces such as in the case of \(f_{\nu _k}\), for any \(k=1,2,\dots ,N\), being defined by a normal (Hamburger case) or gamma (Stieltjes case) density. For these cases, moment determinacy can be assured by the completeness of the Laguerre and Hermite polynomial families (Provost 2005).
In the examples of these types of polynomials in the literature, it is often assumed that all marginal densities of the joint density being approximated have a consistent kind of distribution. Here we make no such assumption, and only ensure that the domain of our multivariate polynomials is consistent with that of our joint density, provided that \(f_{X} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })\) and has finite moments i.e. \(\mathbb {E}_{\textbf{X}} \left[ X_1^{i_1} \cdots X_N^{i_N}\right] < \infty ,\) for all \(i_1, \ldots , i_N\), which enables the correct estimation of joint densities with variables of different domains, a common occurrence in realworld data.
The hypergeometric polynomialbased expansion of a joint density function generally is an infinite sum of polynomials. It is impracticable to evaluate this infinite sum to obtain the complete information of the joint density function. Using finite summation terms in the expansion to approach the density function is one way to address this issue. The function of the finite summation terms is named the finite order multivariate momentbased density, which is defined as follows.
Definition 2
The Finite Order Multivariate MomentBased Density
The finite order multivariate momentbased density approximation of max order \(\textbf{K} = (K_1,\dots ,K_N)\) is defined as
The Weierstrass–Stone approximation theorem provides that \(f_{\textbf{X}:\textbf{K}}\) converges uniformly to \(f_{\textbf{X}}\) as the order \(\textbf{K}\) increases. Using \(f_{\textbf{X}:\textbf{K}}\) as an estimate of \(f_{\textbf{X}}\) is not practicable in reallife data analysis. One issue is the evaluation of the coefficients \(\{C_{n_1, \ldots ,n_N}\}\) in \(f_{\textbf{X}:\textbf{K}}\). The coefficient \(C_{n_1, \ldots ,n_N}\) is a linear function of joint moments of \(\textbf{X}\). The evaluation of coefficients requires the knowledge of the density function \(f_{\textbf{X}}\) while \(f_{\textbf{X}}\) is unknown. The following section introduces the finite order multivariate sample momentbased density function to address this problem.
2.3 Multivariate sample momentbased density estimation
We now consider an approximation of multivariate density functions based on sample moments.
Definition 3
Multivariate Sample MomentBased Density Approximation
Given a random sample \(\{\textbf{X}_i\}_{i=1}^M =\{(X_{1,i},X_{2,i}, \dots , X_{N,i})\}_{i=1}^M\) of \(\textbf{X} = (X_1,X_2,\dots ,X_N)\), define
where \(\textbf{K} = (K_1,\dots ,K_N) \) and \(\hat{C}_{n_1,\dots ,n_N} = (1 / M) \sum _{j=0}^M P_{n_1,\dots ,n_N}(X_{1,j},\dots ,X_{N,j}) \).
We propose using \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimator of \(f_{\textbf{X}}\) and name it the sample momentbased estimator of \(f_{\textbf{X}}\) or the momentbased estimator of \(f_{\textbf{X}}\) if there is no ambiguity.
A significant issue when estimating the multivariate joint density function is computation complexity. The use of \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimate of \(f_{\textbf{X}}\) has some computational advantages.
The marginal momentbased density estimates determined by integrating the joint momentbased density estimate are consistent with what would be achieved if the estimation were applied to the original sample’s corresponding marginal subset. Computationally we note that all coefficients of the various marginal distributions are elements of the set of joint density coefficients, meaning marginal density estimates are easily computed given the joint density coefficients. An essential factor for many density estimation schemes is the ability for these schemes to be adjusted or updated based on new data. Refitting density estimates in large sample and highdimensional scenarios can be highly timeexpensive. The momentbased density estimate is somewhat resilient to this issue, as the data agency can introduce new information to the existing model by averaging a density estimate based solely on the latest data with the original density estimate.
In privacysensitive situations, microlevel information may not necessarily be available due to the risk of information disclosure. Like that employed in Kernel density estimation, local density estimation methods require knowledge of the frequency of observations within a particular bandwidth. Due to privacy concerns, this knowledge may be challenging to obtain in sparsely populated areas. Perturbation techniques are also increasingly being used on privacysensitive statistics. Perturbation greatly reduces the statistical utility of estimates, even when such information is obtainable. By only using sample moments within the density estimate expression, we can tailor our estimate based on what information is available, with samplemoment information usually being more easily accessible, especially for lowerorder moments.
As a result of the large number theory, \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) approaches to \(f_{\textbf{X};\textbf{K}} \) as M increases. At the same time, the function \(f_{\textbf{X};\textbf{K}}\) is a reasonable approximate of \(f_{\textbf{X}}\) as \(K \rightarrow \infty \). Proposing \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\) as an estimator of \(f_{\textbf{X}}\) makes sense. However, \(\textbf{K}\) and \(\{\textbf{X}_i\}_{i=1}^M\) might either independently or mutually impact the performance of the estimator in estimating \(f_{\textbf{X}}\). To construct a good estimator \(\hat{f}_{\textbf{X};\textbf{K}}\) in practice, the up order \(\textbf{K}\) can work as an “adjuster” while the working sample \(\{\textbf{X}_i\}_{i=1}^M\) is fixed.
In the following sections, we propose and demonstrate a computational statistics approach for determining \(\textbf{K}\) in \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x}; \{\textbf{X}_i\}_{i=1}^M )\), given M is fixed. Under the mean square distance \(L_2(\Omega ^{(N)}, \mu _{\nu })\) criterion, we expect that with the determined \(\textbf{K}\), \(\hat{f}_{\textbf{X};\textbf{K}}\) can perform well based on the available sample information from the underlying population \(\textbf{X}\).
3 The \(\mathcal {N}_{\textbf{K}}\)norm and max polynomial order
To quantify the quality of the resulting density estimate, we propose an averaged measure of divergence from the true joint density function, referred to as the \(\mathcal {N}_{\textbf{K}}\)Norm, which is similar to the integrated square error statistic but allows for the decomposition of the error in terms of the estimated and coefficients given in (10).
The definition of the \(\mathcal {N}_{\textbf{K}}\)Norm of the Approximation is given below.
Definition 4
\(\mathcal {N}_{\textbf{K}}\)Norm of the Approximation.
The \(\mathcal {N}_{\textbf{K}}\)Norm of the sample \(\{\textbf{X}_i\}_{i=1}^M\) denoted \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is defined to be the \(L_2({\Omega ^{(N)}},\mu _{\nu })\) norm of the difference between the sample approximation \(\hat{f}_{\textbf{X};\textbf{K}}\left( \textbf{x};\{\textbf{X}_i\}_{i=1}^M \right) \) and the true joint density \(f_{\textbf{X}}(\textbf{x})\) as a ratio of the reference joint density \(f_{\nu }(\textbf{x})\). \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is expressed as
\(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is a statistic. Using \(\mathcal {N}_{\textbf{K}}\) to measure the difference between \(\hat{f}_{\textbf{X};\textbf{K}}\) and \(f_\textbf{X}\) quantitatively is different from measuring the maximum pointwise difference between the two functions. A small value of \(\mathcal {N}_{\textbf{K}}\) cannot guarantee the differences between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\) are small for all \(\textbf{x}\). However, it can ensure that, on average, the differences between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\) can be depicted by the value of \(\mathcal {N}_{\textbf{K}}\). The difference is a difference of ratios. As long as \(f_{\nu }(\textbf{x})\) has a nonzero bound, such a ratio difference can be reasonable to depict the difference between \(\hat{f}_{\textbf{X};\textbf{K}}(\textbf{x})\) and \(f_\textbf{X}(\textbf{x})\), particularly, if \(f_{\nu }(\textbf{x})\) has a uniform distribution.
It is also important to note the similarity of the \(\mathcal {N}_{\textbf{K}}\)Norm to the integrated square error statistic (ISE) of a density estimate \(\hat{f}(\textbf{x})\) of \(f(\textbf{x})\) expressed below,
The point of difference between the \(\mathcal {N}_{\textbf{K}}\)norm and \({\textit{ISE}}(\hat{f})\) is that the former is a weighted squared \(L_2\)norm in the \(L_2({\Omega ^{(N)}},\mu _{\nu })\) space. In contrast, the integrated square error is measured within Euclidean space. Using this weighted measure, the expected value of the statistic, which reflects the variance and bias components of the estimate, can be expressed in terms of the corresponding coefficients in (10). This expression of \(\mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \) is given in Theorem 4.
Theorem 4
The expected value of \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is given by
where \( \mathcal {T}_{\textbf{K}} = \sum _{n_1=0}^\infty \dots \sum _{n_N=0}^\infty C_{n_1,\dots ,n_N}^2  \sum _{n_1=0}^{K_1}\dots \sum _{n_N=0}^{K_N} C_{n_1,\dots ,n_N}^2 \) denotes the remainder term.
The proof of Theorem 4 is simple if we note that,
A crucial dichotomy arises when applying the momentbased density estimation method, which is demonstrated in the expression of the expected value of \(\mathcal {N}_{\textbf{K}}\)Norm. Increasing the order of the approximation will lead to a reduction in the remainder term \(\mathcal {T}_{\textbf{K}}\). However, an increase in the order of approximation also inflates the aggregated variance of the coefficient estimators.
This tradeoff between the remainder term and the estimator variance is at the heart of determining the optimal strategy for approximating the joint density function in practice, based on sample data. The tradeoff also suggests that the usual approach to polynomial approximations used in mathematics, taking the largest order feasible based on computation and accuracy requirements, is inappropriate in the sample context. Instead, it suggests that there may be an optimal max polynomial order to the approximation.
We provide the following definition of an optimal max polynomial order.
Definition 5
An Optimal Max Polynomial Order.
An optimal max polynomial order denoted \(\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})\) is defined as a vector of max polynomial orders in the approximation such that
for any possible max polynomial order \(\textbf{K} = (K_1,\dots ,K_N)\).
This \(\textbf{K}^{*}\) will ensure that the mean of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) takes the smallest value. It does not mean that, for all the realisations of the random sample \(\{\textbf{X}_i\}_{i=1}^M\), \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) will take the smallest value at \(\textbf{K}=\textbf{K}^*\). The variance of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) will provide the information of a confidence level of a realisation of \(\mathcal {N}_{\mathbf {K^*}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) falling in a close neighbourhood of the smallest value.
Given just the definition, it is unclear whether we can obtain such a quantity. It appears entirely feasible that taking larger and larger max polynomial orders may improve the density estimate (based on the \(\mathcal {N}_{\textbf{K}}\)Norm). However, in the following theorem, we show that provided the estimated coefficients applied in the approximation have a nonzero variance (after some point), then the optimal max polynomial order is finite, i.e., for \(\textbf{K}^{*} = (K_1^{*},\dots ,K_N^{*})\) then \(K_k^{*} < \infty \) for \(k = 1,\dots ,N\). In practice, the estimated coefficients applied in the approximation will have a nonzero variance as the estimated coefficients are linear combinations of higher and higher order sample moments.
Theorem 5
Suppose there exists \(\epsilon _0 > 0 \) such that \( {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 \) for all \(n_1,\dots ,n_N \in \mathbb {N}_0\) where \((n_1,\dots ,n_N) \ne (0,\dots ,0)\). Then \( \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \), has a finite optimal max polynomial order.
The proof of Theorem 5 can be found in “Appendix C”.
In a practical situation, we may not necessarily know the true joint density of a sample, and hence it would be untenable to calculate the \(\mathcal {N}_{\textbf{K}}\)Norm statistic. But this value does not explicitly need to be known to determine the optimal max polynomial order. Consider the following definition of the shifted \(\mathcal {N}_{\textbf{K}}\)Norm.
Definition 6
The Shifted \(\mathcal {N}_{\textbf{K}}\)Norm.
The shifted \(\mathcal {N}_{\textbf{K}}\)Norm denoted \(\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is defined as
The relationship between \(\mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) \) and \(\mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \) is summarised as follows:

(1)
\(\begin{aligned} \mathcal {N}_{\textbf{K}}^*\left( \{\textbf{X}_i\}_{i=1}^M \right) =\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M)  \int _{\Omega ^{(N)}}\frac{f^2_{\textbf{X}}(\textbf{x})}{f_{\nu }(\textbf{x})} d\textbf{x} \end{aligned}\).

(2)
If \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))\) take the smallest value at \(\textbf{K}^*\), \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))\) will also take the smallest value at \(\textbf{K}^*\), vice versa.

(3)
Both of \(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M\) and \(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M \) have the same variance.
The second term of \(\mathcal {N}_\textbf{K}^*(\{\textbf{X}_i\}_{i=1}^M\) in (13) can be more easily estimated in a computational approach, i.e. for any random function \(g(\textbf{X})\), \(\mathbb {E}_{\textbf{X}}[g(\textbf{X})]\approx (1/M) \sum _{i=1}^M g(\textbf{X}_i)\). Therefore, given the relationship between \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{{\textbf{K}}}^*(\{\textbf{X}_i\}_{i=1}^M))\) and \(\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}} (\{\textbf{X}_i\}_{i=1}^M))\), the optimal max polynomial order \(\textbf{K}^*\) can be identified through \(\min _{\textbf{K}} \{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}^*(\{\textbf{X}_i\}_{i=1}^M))\}\), rather than \(\min _{\textbf{K}}\{\mathbb {E}_{\textbf{X}}(\mathcal {N}_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M))\}\).
We propose the following strategy for estimating the mean of the shifted \(\mathcal {N}_{\textbf{K}}\)Norm given a random sample \(\{\textbf{X}_i\}_{i=1}^M\). This strategy is valid subject to a condition that the probability distribution of any subsample, which was randomly drawn from the random sample \(\{\textbf{X}_i\}_{i=1}^M\), is the same as that of \(\{\textbf{X}_i\}_{i=1}^M\). This condition is not too strict in general as long as the size of the subsample is not too small.

1.
Randomly partition observations of the random sample into two approximately equalsize partitions \(\{\textbf{X}_i\}_{i\in \mathcal {I}}\) and \(\{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}\).

2.
Calculate the statistic:
$$\begin{aligned}&\hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}}}\right) \\&\quad = \sum _{n_1 = 0}^{K_1} \dots \sum _{n_N = 0}^{K_N} \hat{C}_{n_1,\dots ,n_N}(\{\textbf{X}_i\}_{i\in \mathcal {I}})^2 \\&\qquad \frac{2}{ \overline{\mathcal {I}} } \sum _{j \in \overline{\mathcal {I}} } \frac{\hat{f}_{\textbf{X};\textbf{K}}({X}_{1,j},\dots ,{X}_{N,j};\{\textbf{X}_i\}_{i\in \mathcal {I}})}{f_{\nu } ({X}_{1,j},\dots ,{X}_{N,j})} \end{aligned}$$where \(\overline{\mathcal {I}} \) denotes the sample size of \(\{ \textbf{X}_i \}_{i \in \overline{\mathcal {I}}}\).

3.
Repeat 1 and 2 B times to obtain the set of repeated shifted \(\mathcal {N}_{\textbf{K}}\)Norm statistics
$$\begin{aligned} \Big \{ \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) \Big \}_{b=1}^B. \end{aligned}$$ 
4.
Estimate \(\mathbb {E}_{\textbf{X}}[\mathcal {N}^*_{\textbf{K}}(\{\textbf{X}_i\}_{i=1}^M)]\) by
$$\begin{aligned} \hat{\mathcal {N}}_{\textbf{K};B}^{*}&\left( \{\textbf{X}_i\}_{i=1}^M\right) \nonumber \\&= (1 / B) \sum _{b=1}^B \hat{\mathcal {N}}_{\textbf{K}}^{*}\left( \{\textbf{X}_i\}_{i\in \mathcal {I}_b}; \{\textbf{X}_i\}_{i\in \overline{\mathcal {I}_b}}\right) . \end{aligned}$$(14)The variance of the shifted \(\mathcal {N}_{\textbf{K}}\)Norm statistics can be estimated through the sample variance.
With the information of the mean of the shifted \(\mathcal {N}_{\textbf{K}}\)Norm, we can determine an appropriate upper polynomial order \(\textbf{K}=(K_1, \ldots , K_N) \) for momentbased density function estimation. Sometimes, searching for the proper upper polynomial order \(\textbf{K}\) is a complex process, especially when \(N >3\). To reduce the complexity we may restrict the search process only to those \(\textbf{K}\) with \(K_1=\cdots = K_N\) in practice. However, restricting the search process to those \(\textbf{K}\) with \(K_1=\cdots = K_N\) may not be appropriate when there are considerable differences of scale and distribution between marginal variables of the data. We use such a restricted search process in the comparison study undertaken in the next section as both marginal variables are marginally beta distributed on [0, 1]. In Sect. 5, we do not apply a restricted search process as there are large differences in distribution and scale across the marginal variables of the data.
4 Comparing density estimation methods
In the following section, we compare the performance of the momentbased density estimate and a range of other multivariate density estimation methods through a bivariate example below.
4.1 Bivariate beta distribution example
Consider the Bivariate Beta Distribution discussed in Macomber and Myers (1983), which has density function
for \(x_1,x_2 > 0\) and \(x_1 + x_2 \le 1\). The moments of this bivariate density function are given by Macomber and Myers (1983) as
We perform a simulation study of 100 simulations for four scenarios \(\alpha _1 = 0.6\) and \(\alpha _2 = 0.8\), \(\alpha _1 = 1.6\) and \(\alpha _2 = 0.8\), \(\alpha _1 = 0.6\) and \(\alpha _2 = 1.8\), and \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\) wherein every scenario \(\alpha _3 = 1\). In all the scenarios, we consider the performance of the bivariate Gaussian kernel density estimate (Duong 2007), the bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011), and the density estimation trees estimate (Meyer 2017) against two momentbased density estimates. The first momentbased density estimate (uMBD) applies a uniform reference density for marginal distributions \(\nu _1,\nu _2 \sim U[0,1]\). The other momentbased density estimate (bMBD) applies a beta reference density for both marginal distributions where the parameters of the beta distribution are taken to be the method of moments estimates of the sample data for each marginal.
We note that the bivariate Gaussian kernel density estimate (Duong 2007) is a nonparametric local estimation method and is calculated using the ks package in R and associated kde function with optimal bandwidth selector (see Duong 2018). The bivariate Gaussian copula with a marginal beta distribution (Hofert et al. 2020; Yan 2007; Kojadinovic and Yan 2010; Hofert and Mächler 2011) is a parametric global density estimate and is calculated using the copula package in R (see Hofert et al. 2020) with the marginal beta distribution parameters taken to be the corresponding method of moments. The density estimation trees estimate (Meyer 2017) is a machine learning estimate and is calculated using the detpack package in R with element refinement goodness of fit significance taken to be \(1\times 10^{5}\) (see Meyer 2019).
For all density estimates, the integrated square error (12) with respect to the true bivariate beta density function (15) is determined by Monte Carlo integration using a uniform sample of size 100,000 on \([0,1] \times [0,1]\). The ISE statistic is averaged across all 100 simulations to give the mean integrated square error (MISE). In the case of the uMBD momentbased density estimate, the ISE statistic corresponds precisely to the \(\mathcal {N}_{\textbf{K}}\)Norm statistic (see Definition 4), and hence the MISE is an estimate of the expected \(\mathcal {N}_{\textbf{K}}\)Norm expressed in Theorem 4. In the case of the bMBD momentbased density estimate, the mean \(\mathcal {N}_{\textbf{K}}\)Norm across the 100 simulations is also determined using Monte Carlo integration with a uniform sample of size 100,000 on \([0,1] \times [0,1]\).
The max polynomial order applied in the expression of the momentbased density estimates, \(\textbf{K} = (K_1, K_2)\), is set to be consistent across both marginal distributions, i.e. \(K = K_1 = K_2\). We determine the optimal max polynomial order (see Definition 5) by selecting the max polynomial order which minimises the mean \(\mathcal {N}_{\textbf{K}}\)Norm. For uMBD this is consistent with minimising the MISE. For each comparison density estimate, we also determine the percentage of simulations in which the uMBD and bMBD achieve a smaller MISE value when the optimal max polynomial order is used.
The remainder term component of the mean \(\mathcal {N}_{\textbf{K}}\)Norm, i.e. the \(\tau _{\textbf{K}}\) term given in Theorem 4, is also estimated by evaluating the \(\mathcal {N}_{\textbf{K}}\)Norm when theoretical moments are used to express the momentbased density estimate. This term reflects the bias component of error in the momentbased density estimate.
We provide the simulation study results for all four scenarios in Table 2.
The simulation study shows that both the uMBD and bMBD estimates outperform the Gaussian kernel density estimate nearly universally for all scenarios in which at least one \(\alpha _i < 1\) (\(i=1,2\)). However, the Gaussian kernel density estimate universally outperforms both the uMBD and bMBD estimate when \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\). When \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\), the true density \(f_{X_1,X_2}(x_1,x_2)\) achieves its maximum on the boundary condition \(x_1 + x_2 \le 1\). This boundary condition is not specifically enforced by construction for any of the density estimates. Although due to the local estimation applied by the Gaussian kernel density estimate, zero areas of the sample space can be better predicted. The polynomial curves applied in the uMBD and bMBD approaches require a high max polynomial order to model the immediate change in density values to zero adequately. This issue is not as significant in previous parameterisations as density function values near the \(x_1 + x_2 \le 1\) boundary are small. As a higher max polynomial order would also constitute greater sampling variance, the optimal max polynomial order remains relatively low, reducing the quality of the momentbased density estimates relative to the Gaussian kernel density estimate for the \(\alpha _1 = 1.6\) and \(\alpha _2 = 1.8\) parameterisation. The Gaussian copula with marginal beta distribution performs best comparatively in scenarios where \(\alpha _1 = 0.6\), outperforming the uMBD estimate in approximately 86–88% of samples. The copula estimate does not appear to perform as well as the bMBD estimate in general for any scenario, only achieving a smaller MISE in at most 1% of simulations in each scenario. The density estimation trees estimate similarly does not perform as well as the bMBD estimate across all scenarios. The uMBD estimate also outperforms the density estimation trees estimate in most simulations in all scenarios.
5 Generating synthetic microdata example
In the following section, we provide an example of synthetic microdata generated from a “confidential” data set. Before this example, we outline the process of simulating synthetic data from the momentbased density estimate.
5.1 Simulating synthetic data from the density estimate
Given a random sample \(\{(Y_i,\textbf{X}_i)\}_{i=1}^M =\{(Y_{i},X_{1,i}, \dots , X_{N,i})\}_{i=1}^M\) of \((Y,\textbf{X}) = (Y,X_1,X_2,\dots ,X_N)\), and corresponding momentbased density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\), we define the momentbased conditional cumulative probability function estimate of Y given \(\textbf{X}\) as follows,
where \(\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )\) denotes the momentbased density estimate of \(\{\textbf{X}_i\}_{i=1}^M\) with max polynomial order \(\mathbf {K_X}\), and \(a_Y\) denotes the lower bound of Y, i.e. \(Y \in [a_Y,b_Y]\).
Given Property (2) in Sect. 2.1.1, we derive the following expression for the momentbased conditional cumulative probability function estimate of Y given \(\textbf{X}\), \(\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\).
where \(f_{\nu _Y}\) and \(f_{\nu _{\textbf{X}}}\) are the reference densities of Y and \(\textbf{X}\) respectively, \( \hat{C}_{n_Y,n_{X,1},\dots ,n_{X,N}} \), are the estimated coefficients of the momentbased density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\) with \(\textbf{K} = (K_Y,K_{X,1},\dots ,K_{X,N})\), \(\lambda ^{*}_{Y,n_Y} = {\left\{ \begin{array}{ll} \lambda _{Y,n_Y} &{} n_Y = 1,2,\dots \\ 1 &{} n_Y = 0 \end{array}\right. }\) where \(\lambda _{Y,n_Y}\) is the \(n_Y\)th eigenvalue related to the univariate polynomial \(P_{Y,n_Y}(y)\), \(F_{\nu _Y}(y)\) denotes the reference cumulative distribution function of Y, and \(\delta _{n_Y,0}\) denotes the Kronecker delta.
We make note of the following in regards to \(\hat{F}_{Y {\mid }\textbf{X};\textbf{K}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\):

1.
We have an analytical expression of the momentbased conditional cumulative probability function estimate of Y given \(\textbf{X}\), which does not require numerical integration.

2.
The momentbased conditional cumulative probability function estimate of Y given \(\textbf{X}\) is a univariate cumulative distribution function (in terms of Y) from which we can simulate an observation \(Y^{*}\) using inverse transform sampling. That is, given \(X_1^{*},\dots ,X_N^{*}\) from \(\hat{f}_{\textbf{X};\mathbf {K_X}}(\textbf{x};\{\textbf{X}_i\}_{i=1}^M )\) and a uniform random variable \(U\sim \text {Unif}[0,1]\), then \(Y^{*} = \hat{F}_{Y {\mid }\textbf{X};\textbf{K}}^{1}(U {\mid }X_1^{*},\dots ,X_N^{*}; \{(Y_i,\textbf{X}_i)\}_{i=1}^M)\) and \(X_1^{*},\dots ,X_N^{*}\) have joint density \(\hat{f}_{Y,\textbf{X};\textbf{K}}((y,\textbf{x}); \{(Y_i,\textbf{X}_i)\}_{i=1}^M )\).

3.
Without loss of generality, we can express any column of our data set as \(\{Y_i\}_{i=1}^M\) and the others as \(\{\textbf{X}_i\}_{i=1}^M\).

4.
By applying a Gibbs sampling approach, we can simulate all columns of the data to form a complete synthetic data set.
In the following section, we obtain a 5dimensional momentbased density estimate and simulate data from a “sensitive” variable of the data (a variable that poses a risk to confidentiality if disclosed) by applying inverse transform sampling to the momentbased conditional cumulative probability function estimate.
5.2 Synthetic data example
The data we consider \(\{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\) are 9568 observations of 5 variables collected from a combined cycle power plant over six years (2006–2011) when the power plant was set to work with full load (Tüfekci 2014; Kaya and Tufekci 2012). The dataset consists of hourly average ambient variables (\(\textbf{X} = (X_1,X_2,X_3,X_4)\)): temperature (\(X_1\)), ambient pressure (\(X_2\)), relative humidity (\(X_3\)), and exhaust vacuum (\(X_4\)), that were collected to predict the net hourly electrical energy output of the plant (Y).
Suppose that the values of the plant’s net hourly electrical energy output are considered “sensitive” due to a risk of proprietary infringement or operational sabotage were the actual data released. We produce synthetic data that mimics the true values of the net hourly electrical energy output by obtaining a momentbased density estimate \(\hat{f}_{Y,\textbf{X};\textbf{K}}\) of the joint distribution of the data. For each variable, we select uniform reference densities with bounds consistent with its range plus or minus (for the maximum and minimum, respectively) a buffer of 5% of its standard deviation.
We estimate the optimal max polynomial order \(\hat{\textbf{K}^{*}} =(\hat{K}_1^{*},\dots ,\hat{K}_5^{*})\) by minimising \(\hat{\mathcal {N}}_{\textbf{K};B}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \) (14) for \(B=100\),
Note that we only considered values of \(\hat{K}^{*}_k \le 35\), for \(k=1,\dots ,5\), as the statistic was diverging in all cases.
Averaged across \(B=100\) randomly sampled partitions, we achieve the estimates of the shifted \({\mathcal {N}}_\textbf{K}\)Norm (\( \hat{\mathcal {N}}_{\textbf{K};100}^{*}\left( \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568}\right) \)) for various max polynomial orders from 1 to 35 for each variable. The estimated optimal max polynomial order is determined to be \(\hat{\textbf{K}^{*}} = (21,22,12,5,22)\).
We simulate a synthetic sample of net hourly electrical energy outputs (\(\{Y_i^{*}\}_{i=1}^{9658}\)) using inversetransform sampling on the momentbased conditional cumulative probability function estimate \(\hat{F}_{Y {\mid }\textbf{X};\hat{\textbf{K}^{*}}}^{1}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})\) where \(\hat{\textbf{K}^{*}} = (21,22,12,5,22)\). We note that given uniform reference densities are used, \(\hat{F}_{Y {{\mid }} \textbf{X};\hat{\textbf{K}^{*}}}(y {\mid }\textbf{x}; \{(Y_i,\textbf{X}_i)\}_{i=1}^{9568})\) is a polynomial with respect to y. We apply the base R polynomial rooting function polyroot() (Jenkins and Traub 1972) to compute values from the inverse momentbased conditional cumulative probability function estimate.
Histograms comparing the distribution of the actual net hourly electrical energy output (\(\{Y_i\}_{i=1}^{9658}\)) and synthetic net hourly electrical energy output (\(\{Y_i^{*}\}_{i=1}^{9658}\)) are given in Fig. 1. Scatterplots showing the associations of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 2. Heatmaps showing the bivariate marginal density of both the actual and synthetic net hourly electrical energy outputs with each net hourly ambient variable are given in Fig. 3.
The histograms in Fig. 1 indicate that the synthetic net hourly electrical energy outputs approximate the shape of the actual net hourly electrical energy outputs well. However, synthetic net hourly electrical energy outputs appear slightly overpopulated in the distribution’s tails compared to the actual net hourly electrical energy outputs, leading to a slightly “flatter” distribution overall. This overpopulation of synthetic net hourly electrical energy outputs in sparsely populated areas of the actual data is also apparent when comparing the bivariate associations of the net hourly electrical energy outputs with each net hourly ambient variable shown in Figs. 2 and 3 . The general pattern of association between the net hourly electrical energy outputs and each net hourly ambient variable is approximated by the synthetic data well. However, sparse areas of the actual data appear overestimated by the momentbased density estimate.
To assess the statistical utility of the synthetic data for prediction modelling, we compare statistics obtained by a maineffects ordinary least squares regression of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. The results of the linear regression analysis are given in Table 3. The estimated coefficient of the ambient temperature data (\(1.727\), 95% CI [\(1.785, 1.669\)]) and the relative humidity data (\(0.128\), 95% CI [\(0.143,0.112\)]) obtained from the synthetic data underestimated the magnitude of the corresponding estimated coefficient obtained from the actual data (\(1.978\) and \(0.158\) respectively). Whereas the estimated coefficients of exhaust volume (\(0.259\), 95% CI [\(0.287,0.231\)]) and ambient pressure (0.093, 95% CI [0.057, 0.128]) obtained from the synthetic data better represent the estimated coefficients obtained from the actual data, with both actual coefficient estimates (\(0.234\) and 0.062 respectively) belonging to the 95% confidence intervals of the corresponding coefficient estimates obtained from the synthetic data. All coefficient estimates obtained from the synthetic data have increased standard error estimates. The syntheticbased linear regression model has an increased residual mean square error estimate (8.804) compared to the actualbased linear regression model (4.558). A reduction in the proportion of variation explained by the model (\(R^2\)) is also seen in the syntheticbased model, with \(R^2 = 0.748\) from the syntheticbased model being less than \(R^2 = 0.929\) from the actualbased model.
Although the performance of the synthetic data appears mixed, unlike other modelbased synthetic generation methods, synthetic data obtained from the momentbased density estimate is not model specific and can be used to fit any model. For instance, suppose we were to fit a full factorial linear regression model of the net hourly ambient variables on the actual and synthetic net hourly electrical energy outputs. Without redefining our generation model or resampling, we obtain the results given in Table 4 for the same synthetic data used to obtain the results in Table 3. Again, all coefficient estimates obtained from the synthetic data had increased standard error estimates. Although in all cases, the estimated coefficients obtained from the actual data for the full factorial model are contained within the 95% confidence interval of the corresponding estimated coefficients obtained from the synthetic data.
Considering the disclosure risk posed by the synthetic net hourly electrical energy outputs, we note that only approximately 3% of observations in the actual data have synthetic observations within a 1% standarddeviationbased interval (MateoSanz et al. 2004) of their values. Figure 2 shows that regions containing multivariate outliers in the actual data are well populated in the synthetic data. A data intruder, therefore, would find it difficult to determine which synthetic observations (if any) could be successfully attributed to an actual observation and which should be treated as erroneous based on the synthetic data alone. Given a public release of the synthetic data, we suggest that the disclosureprone multivariate outliers present in the actual data are more greatly shielded from attribution due to the high presence of synthetic observations in and around corresponding regions of the synthetic data.
6 Conclusion
Our specific interest in this paper is the application of momentbased density estimation in the context of sharing confidential microdata while preserving the privacy of the respondents. The multivariate moments’ problem and its application to estimating joint density functions are often considered highly impractical in modernday analysis. Methods that do exist in this context, often fail to account for the sampling problems faced in realworld applications due to the computational complexity and intractability of higherorder moments. One of the critical difficulties in obtaining momentbased density expressions is the breadth of variation seen in densities requiring estimation. The support of these distributions must align with the orthogonal basis used in the density approximation (Kleiber and Stoyanov 2013). By requiring consistency in sample space, we inherently arrive at three different cases: Hausdorff (density functions defined on a compact space), Stieltjes (density functions defined on a half bounded (lower or upper) space) (Stieltjes 1894; Stoyanov and Tolmatz 2005) and Hamburger (density functions defined on a completely unbounded space) (Shohat and Tamarkin 1943; Akhiezer and Kemmer 1965). We have focused mainly on the Hausdorff problem, jointdensities on a compact space, and have only referred to our ability to apply such methods on the unbounded space provided moment determinacy.
Determinacy refers to the existence of a unique momentbased expansion of the density function. Kleiber and Stoyanov (2013) extensively review this concept in the context of the multivariate moments problem. Interestingly, the paper investigates the relationship between the marginal determinacy and the determinacy of the multivariate distribution as a whole. It raises Peterson’s result (Petersen 1982), that marginal determinacy implies multivariate determinacy and conjectures that for absolutely continuous functions, multivariate determinacy implies marginal.
Therefore we can comfortably apply the estimation results presented in this paper to unbounded distributions if, at the very least on the marginal level, our choice of unbounded reference densities converges to a unique polynomial expression. Furthermore, due to the unique properties of the Gaussian and gamma distributions (and others related, e.g., ChiSquare, Weibull), it has been shown that reference densities of this variety have momentdeterminacy (SánchezRuiz and Dehesa 1998; Provost 2005). Therefore this approach has greater application than may appear at first glance. Furthermore, when unsure of the appropriateness of a particular polynomial basis, we can instead focus on reviewing the convergence marginally rather than on the joint density as a whole to gain an idea of momentdeterminacy.
In this paper, we proposed a generalised form of a multivariate polynomialbased density expansion, making the estimate applicable to one type of polynomial basis and any hypergeometric type polynomial basis (provided determinacy is assured). This expansion is generated from a generalised multivariate polynomial basis that, by construction, allows for the estimation of joint density functions based on aggregated sufficient statistics, namely the sample moments. We derived critical statistical properties of the momentbased density estimate and suggested a computational strategy by which we can estimate the optimal max polynomial order of the expansion. Through the contributions of this paper, we have demonstrated that applying multivariate polynomial expansion to joint density estimation is beneficial for analytical formulations and can be used as a practical density estimation alternative to existing comparable methods.
Furthermore, we can extend this result to estimate both cumulative and conditional cumulative distribution functions, allowing for the simulation of data from the momentbased density estimates without requiring gridbased sampling techniques. This property has considerable ramifications when producing synthetic representative samples of the original data and imputing missing information, especially when we can only ascertain empirical moments. The production of nonparametric synthetic continuous data based on aggregated statistics is an area of great interest in data privacy research.
Change history
27 February 2023
Missing Open Access funding information has been added in the Funding Note.
References
Akhiezer, N.I., Kemmer, N.: The Classical Moment Problem: And Some Related Questions in Analysis, vol. 5. Oliver & Boyd, Edinburgh (1965)
Alexits, G.: Convergence Problems of Orthogonal Series. International Series of Monographs in Pure and Applied Mathematics, vol. 20, pp. 63–170. Pergamon Press, Oxford (1961). https://books.google.com.au/books?id=VAJRAAAAMAAJ
Charlier, C.V.L.: Frequency curves of type a in heterograde statistics. Ark. Mat. Ast. Fysik 9, 1–17 (1914)
Cramer, H.: Mathematical Methods of Statistics, vol. 9, pp. 85–89. Princeton University Press, Princeton (1946)
Duong, T.: ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Softw. 21(7), 1–16 (2007). https://doi.org/10.18637/jss.v021.i07
Duong, T.: Ks: Kernel Smoothing. R package version 1.11.3. https://CRAN.Rproject.org/package=ks (2018)
Edgeworth, F.Y.: The law of error. Proc. Cambridge Philos. Soc. 20, 36–65 (1905)
Elderton, W.P., Johnson, N.L.: Systems of Frequency Curves, 1st edn., pp. 110–121. Cambridge University Press, Cambridge (1969). https://doi.org/10.1017/CBO9780511569654
Gram, J.P.: On the development of real functions in rows by the method of least squares. J. Reine Angew. Math. 94, 41–73 (1883)
Griffiths, R.C., Spano, D.: Multivariate Jacobi and Laguerre polynomials, infinitedimensional extensions, and their probabilistic connections with multivariate Hahn and Meixner polynomials. Bernoulli Soc. 17, 1095–1125 (2011). https://doi.org/10.3150/10BEJ305
Hofert, M., Kojadinovic, I., Maechler, M., Yan, J.: Copula: Multivariate Dependence with Copulas. R package version 1.01. https://CRAN.Rproject.org/package=copula (2020)
Hofert, Marius: Mächler, Martin: Nested Archimedean copulas meet R: the nacopula package. J. Stat. Softw. 39(9), 1–20 (2011)
Izenman, A.J.: Review papers: recent developments in nonparametric density estimation. J. Am. Stat. Assoc. 86(413), 205–224 (1991). https://doi.org/10.1080/01621459.1991.10475021
Jenkins, M.A., Traub, J.F.: Algorithm 419: zeros of a complex polynomial [c2]. Commun. ACM 15(2), 97–99 (1972)
Kaya, H., Tufekci, P.: Local and global learning methods for predicting power of a combined gas & steam turbine, pp. 13–18 (2012)
Kendall, M.G., Stuart, A., Ord, J.K.: Kendall’s Advanced Theory of Statistics. Oxford University Press Inc, New York (1987)
Kleiber, C., Stoyanov, J.: Multivariate distributions and the moment problem. J. Multivar. Anal. 113, 7–18 (2013)
Kojadinovic, Ivan: Yan, Jun: Modeling multivariate distributions with continuous margins using the copula R package. J. Stat. Softw. 34(9), 1–20 (2010)
Lin, Y.X.: Density approximant based on noise multiplied data. In: DomingoFerrer, J. (ed.) Privacy in Statistical Databases, pp. 89–104. Springer, Cham (2014)
Macomber, J.H., Myers, B.L.: The bivariate beta distribution: comparison of Monte Carlo generators and evaluation of parameter estimates. ACM SIGSIM Simul. Dig. 14(1–4), 26–35 (1983)
MateoSanz, J.M., Sebé, F., DomingoFerrer, J.: Outlier protection in continuous microdata masking. In: DomingoFerrer, J., Torra, V. (eds.) Privacy in Statistical Databases. PSD 2004. Lecture Notes in Computer Science, pp. 201–215. Springer, Berlin (2004)
Mehler, F.G.: G. reihenentwicklungen nach laplaceschen functionen hoherer ordnung. J. Reine Angew. Math. 66, 161–176 (1866)
Meyer, D.W.: Density estimation with distribution element trees. Stat. Comput. 28(3), 609–632 (2017). https://doi.org/10.1007/s1122201797519
Meyer, D.: Detpack: Density Estimation and Random Number Generation with Distribution Element Trees. R package version 1.1.3. https://CRAN.Rproject.org/package=detpack (2019)
Munkhammar, J., Mattsson, L., Rydén, J.: Polynomial probability distribution estimation using the method of moments. PLoS ONE 12(4), 1–14 (2017). https://doi.org/10.1371/journal.pone.0174573
Mustapha, H., Dimitrakopoulos, R.: Generalized Laguerre expansions of multivariate probability densities with moments. Comput. Math. Appl. 60(7), 2178–2189 (2010). https://doi.org/10.1016/j.camwa.2010.08.008
Petersen, L.: On the relation between the multidimensional moment problem and the onedimensional moment problem. Math. Scand. 51(2), 361–366 (1982)
Provost, S.B.: Moment based density approximants. Math. J. 9, 728–756 (2005)
Reid, N.: Saddlepoint methods and statistical inference. Statist. Sci. 3(2), 213–227 (1988). https://doi.org/10.1214/ss/1177012906
SánchezRuiz, J., Dehesa, J.S.: Expansions in series of orthogonal hypergeometric polynomials. J. Comput. Appl. Math. 89(1), 155–170 (1998)
Shohat, J.A., Tamarkin, J.D.: The Problem of Moments, vol. 1, pp. 860–863. American Mathematical Society, New York (1943)
Solomon, H., Stephens, M.A.: Approximations to density functions using Pearson curves. J. Am. Stat. Assoc. 73(361), 153–160 (1978)
Stieltjes, T.J.: Recherches sur les fractions continues. Annales de la Faculté des sciences de Toulouse: Mathématiques 8(4), 1–122 (1894)
Stoyanov, J.: Moment properties of probability distributions used in stochastic financial models. In: Recent Advances in Financial Engineering 2014: Proceedings of the TMU Finance Workshop 2014, pp. 1–27, World Scientific (2016)
Stoyanov, J., Tolmatz, L.: Method for constructing Stieltjes classes for Mindeterminate probability distributions. Appl. Math. Comput. 165(3), 669–685 (2005)
Szegő, G.: Orthogonal Polynomials. American Mathematical Society, vol. 23, pp. 38–100. American Mathematical Society, Providence (1939). https://books.google.com.au/books?id=WRoPAAAAIAAJ
SzökefalviNagy, B.: Introduction to Real Functions and Orthogonal Expansions, pp. 310–337. Oxford University Press, Oxford (1965)
Tüfekci, P.: Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 60, 126–140 (2014). https://doi.org/10.1016/j.ijepes.2014.02.027
Wang, Z., Scott, D.W.: Nonparametric density estimation for highdimensional dataalgorithms and applications. Wiley Interdiscip. Rev. Comput. Stat. 11(4), 1461 (2019)
Withers, C.S., Nadarajah, S.: Expansions for the multivariate normal. J. Multivar. Anal. 101(5), 1311–1316 (2010). https://doi.org/10.1016/j.jmva.2010.01.001
Xu, Y.: On orthogonal polynomials in several variables. Special functions, qSeries and related topics. Fields Inst. Res. Math. Sci. Commun. Ser. 14, 247–270 (1997)
Yan, Jun: Enjoy the joy of copulas: with a package copula. J. Stat. Softw. 21(4), 1–21 (2007)
Acknowledgements
This research was partially completed with support from the Australian Government Research and Training Program.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: In the original publication of the article, Missing Open Access funding information has been added in the Funding Note.
Appendices
Appendix A: Proof of Theorem 2
Proof
Firstly as the integrals on \(\Omega _k\) are all finite, then
Furthermore, given that for each \(k=1,\dots ,N\), the set of orthonormal polynomials \(\{ P_{k,n_k}\}_{n_k \in \mathbb {N}_0} \) are complete in \(L_2(\Omega _k,\mu _{\nu _k})\). That is, for any function \(g(x_k)\) on \( \Omega _k \), if
then, \( g(x_k) = 0\) for all \(x_k \in \Omega _k\) a.s.. Therefore for \(P_{n_1, \dots , n_N}( \textbf{x} ) = \prod _{k=1}^N P_{k,n_k}(x_k)\) and any function h on \({\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k \), then for all \(\textbf{x} \in {\Omega ^{(N)}}\), if
then it follows that
for all \( n_1, \dots , n_N \in \mathbb {N}_0\). Hence,
Due to the fact \(P_{N,n_N}(x_N)\) is complete, this infers that, for all \(x_N \in \Omega _N\) a.s.,
Applying the same reasoning we can see that for all \(x_i \in \Omega _i (i=3,\dots ,N)\) a.s., we have
for all \( n_1, n_2 \in \mathbb {N}_0\), which implies for all \(x_i \in \Omega _i, i=2,\ldots ,N,\) a.s.,
This leads to
Therefore we can say that for any function h on \({\Omega ^{(N)}} = \prod _{k=1}^N \Omega _k \) then
then \(h(\textbf{x}) = 0\;\text {for all}\; \textbf{x} \in {\Omega ^{(N)}}\) almost surely. Therefore \(\{ P_{n_1,\dots , n_N}(\textbf{x}) \}_{n_1,\dots ,n_N \in \mathbb {N}_0} \) are a complete set of multivariate orthonormal polynomials. \(\square \)
Appendix B: Proof of Theorem 3
Proof
By the Weierstrass–Stone Approximation (SzökefalviNagy 1965), we know there exists a unique polynomial approximation that converges uniformly to the continuous function \(f_{\textbf{X}}(\textbf{x}) / f_{\nu }(\textbf{x}) \). This result is held because \(\{P_{n_1, \cdots , n_N}\}_{n_1,\dots ,n_N \in \mathbb {N}_0}\) is a complete orthogonal base. Let
Given \(f_{\textbf{X}} / f_{\nu } \in L_2({\Omega ^{(N)}},\mu _{\nu })\), then, for fixed \(m_1, \dots , m_N \in \mathbb {N}_0\), by the CauchySchwarz inequality,
Therefore by the Fubini–Tonelli theorem, it follows
Hence
Therefore by the uniqueness of the polynomial approximation, then,
where the coefficient terms \(C_{n_1,\dots ,n_N}\) are given by
\(\square \)
Appendix C: Proof of Theorem 5
To prove Theorem 5, we need Lemma 1 below. This Lemma provides a vital property as the convergence of the “remainder term” as described in Theorem 4.
Lemma 1
For a dimension \(j \in \{1,\dots ,N\}\) as well as for any \(\epsilon > 0\) and \(m_1,\dots ,m_{j1},\) \(m_{j+1},\dots ,m_N \in \mathbb {N}_0\) there exists \(K_j\) such that for all \(n_{j} > K_{j}\),
Proof
We will only provide the detailed proof for the case \(N = 3\) for Lemma 1 and for the general N, we can achieve the proof by following the same process outlined for the proof of \(N=3\).
Firstly we note that the finite order multivariate momentbased density converges to the true density as all the elements of the max polynomial order vector \(\textbf{K}\) approaches infinity. This property follows as,
Therefore,
We note therefore that for any \(\epsilon > 0\) there exists \(\textbf{K}_0 = (K_{0,1},\dots ,K_{0,N})\) such that
Consider for \(N = 3\) and denote
Therefore, for any \(\epsilon _0 > 0\) we can fix \(\epsilon = \epsilon _0/2\) and note that there therefore exists a \(\textbf{K}_0 = (K_{0,1},K_{0,2},K_{0,3})\) such that \( \mathcal {T}_{\textbf{K}_0} < \epsilon _0 / 2\). Thus, for such \(\textbf{K}_0\),
As every term is positive, we note that this implies that
Furthermore, following the same process, if we were to continue to partition all the infinite sums we would have
Now consider for any \(m_1, m_2\) the sum
We have four distinct cases of this sum:
Case 1 \(m_1 \le K_{0,1}\) and \(m_2 \le K_{0,2}\).
When \(m_1 \le K_{0,1}\) and \(m_2 \le K_{0,2}\) then it follows that
Case 2 \(m_1 \le K_{0,1}\) and \(m_2 > K_{0,2}\).
When \(m_1 \le K_{0,1}\) and \(m_2 > K_{0,2}\) then it follows that
Case 3 \(m_1 > K_{0,1}\) and \(m_2 \le K_{0,2}\).
When \(m_1 > K_{0,1}\) and \(m_2 \le K_{0,2}\) then it follows that
Case 4 \(m_1 > K_{0,1}\) and \(m_2 > K_{0,2}\).
When \(m_1 > K_{0,1}\) and \(m_2 > K_{0,2}\) then it follows that
Therefore it follows for any \(m_1, m_2\) we have
Similarly, repeating this process we have
\(\square \)
Proof of Theorem 5
For a fixed \(j \in \{ 1,\dots ,N \}\) we define the notation for some vector \(\textbf{a} = (a_1,\dots ,a_N)\) that \(\textbf{a}_{(j)} = (a_1,\dots ,a_{j1}, a_j +1, a_{j+1},\dots a_N)\).
We assume there exists \(\epsilon _0 > 0 \) such that \( {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 \) for all \(n_1,\dots ,n_N \in \mathbb {N}_0\) where \((n_1,\dots ,n_N) \ne (0,\dots ,0)\).
Consider the \(\epsilon _0\) and j, from Lemma 1, for any \(m_1,\dots ,m_{j1}\), \(m_{j+1},\dots ,m_N \in \mathbb {N}_0\) there exists \(K_j(\epsilon _0)>0\) such that for all \(m_{j} > K_{j}(\epsilon _0)\),
Therefore
Suppose, if there is a \(\textbf{m} =(m_1, \ldots , m_N)\) with a \(m_{j} >K_{j}(\epsilon _0)\) such that
We note that
And hence as \(m_j> K_j(\epsilon _0) >0 \), from (C1),
However as \({\textit{Var}}_{\textbf{X}}\left( \hat{C}_{n_1,\dots , n_N} \right) \ge \epsilon _0\) for all \(\textbf{n} = (n_1, \ldots , n_N) \not = (0, \ldots , 0)\),
which is a contradiction.
Therefore we can conclude that if there exists \(\epsilon _0 > 0\) such that \( {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 \) for all \(\textbf{n} = (n_1,\dots ,n_N) \ne (0,\dots ,0)\), then for any \(m_1,\dots ,m_{j1},m_{j+1},\dots ,m_N\) there exists a \(K_j(\epsilon _0) >0\) such that for all \(m_j\) where \(m_j > K_j(\epsilon _0)\) and \(\textbf{m} = (m_1,\ldots , m_N)\) then
As this applies to any dimension \(j \in \{ 1,\dots ,N \} \) then we can say that there exists a vector \(\textbf{K}_0\) whose elements are given by \(K_{0,j} = K_{j}\) for each \(j = 1,\dots ,N\) such that for any max polynomial order \(\textbf{K} =(K_1, \ldots , K_N)\) not less than \(\textbf{K}_0\). then,
Hence given that \(\mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] > 0\) for any \(\textbf{K} \in \mathbb {N}_0^N \), then an optimal max polynomial order (as defined in Definition 5) \(\textbf{K}^*\) given by \(\textbf{K}^* = \arg \!\min _{\textbf{K} \in \mathbb {N}_0^N }\{ \mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \} \) can also be determined by considering the finite set \(\mathcal {K} = \{ \textbf{K}=(K_1,\dots ,K_N) : K_i \in \mathbb {N}_0 \text { and } K_i \le K_{0,i} \text { for each } i = 1,\dots ,N \}\) which is finite as \(\mathcal {K} _c = \prod _{i=1}^N (K_{0,i} + 1) < \infty \).
We note that for all \(\textbf{K} \in \mathbb {N}_0^N \backslash \mathcal {K}\),
Hence \(\mathbb {E} \left[ \mathcal {N}_{\textbf{K}_0}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \) is an infimum to the set \(\{\mathbb {E} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] : \textbf{K} \in \mathbb {N}_0^N \backslash \mathcal {K} \}\) and therefore we can say as \(\textbf{K}_0 \in \mathcal {K}\), that
Noting that all finite sets must have a minimum, we can then be sure that if there exists \(\epsilon _0 > 0 \) such that \( {\textit{Var}}_{\textbf{X}}(\hat{C}_{n_1,\dots ,n_N}) \ge \epsilon _0 \) for all \(n_1,\dots ,n_N \in \mathbb {N}_0\) where \((n_1,\dots ,n_N) \ne (0,\dots ,0)\), then the expected \(\mathcal {N}_{\textbf{K}}\)Norm \( \mathbb {E}_{\textbf{X}} \left[ \mathcal {N}_{\textbf{K}}\left( \{\textbf{X}_i\}_{i=1}^M \right) \right] \) (Definition 4) has a finite optimal max polynomial order \(\textbf{K}^*\). \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wakefield, B., Lin, YX., Sarathy, R. et al. Momentbased density estimation of confidential microdata: a computational statistics approach. Stat Comput 33, 35 (2023). https://doi.org/10.1007/s11222022102031
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222022102031
Keywords
 Density estimation
 Confidential data
 Moment problems
 Multidimensional approximations
Mathematics Subject Classification
 62G07
 62P99
 44A60
 41A63