Mixture Density Network Estimation of Continuous Variable Maximum Likelihood Using Discrete Training Samples

Mixture Density Networks (MDNs) can be used to generate probability density functions of model parameters $\boldsymbol{\theta}$ given a set of observables $\mathbf{x}$. In some applications, training data are available only for discrete values of a continuous parameter $\boldsymbol{\theta}$. In such situations a number of performance-limiting issues arise which can result in biased estimates. We demonstrate the usage of MDNs for parameter estimation, discuss the origins of the biases, and propose a corrective method for each issue.


Introduction
A frequent goal in the analysis of particle physics data is the estimation of an underlying physical parameter from observed data, in situations where the parameter is not itself observable, but its value alters the distribution of observables. One typical approach is to use maximum likelihood estimation (MLE) to extract values of the underlying parameters and their statistical uncertainties from experimental distributions in the observed data. In order to do this, a statistical model (x| ) of the observable(s), as a function of the underlying parameters, must be available. These are frequently available only from Monte Carlo simulation, not from analytic predictions. In typical usage, the value of a parameter is varied across a range of possible values; the derived models (determined from histograms or other kernel density estimators, or approximated with analytic functions) are then compared to the distributions in the observed data to estimate the parameter.
A number of methods to perform this type of inference have been discussed in the literature. See, for example, Refs. [1][2][3][4][5][6]. Some of these also use machine-learning approaches, and many support the use of multiple observables in order to improve statistical power.
If one has a complete statistical model (x| ) for the observables available for any given value of the parameter, the MLE can be computed. Unfortunately, this is usually difficult to determine analytically, especially if there are multiple observables with correlations, detector effects, or other complications. An alternative procedure is to directly approximate the likelihood function of the parameter, L ( |x).
Mixture Density Networks [7] solve a closely-related task. They are used to approximate a posterior density function ( |x) of parameters from input features x as a sum of basis probability density functions (PDFs). More specifically, the neural network predicts the coefficients/parameters of the posterior density. With Bayes' theorem, we will relate the posterior density, which is output by the network, to the desired parameter likelihood function. Notably, because of the flexibility of network structure, MDNs permit the straightforward use of multidimensional observables, as well as approximating the posterior density function with any desired set of basis PDFs.
When training MDNs, one typically assumes that all input parameter values are equally likely to be sampled in the training dataset, and that the parameter is continuous. In essence, this is equivalent to specifying a flat prior for the application of Bayes' theorem that translates the posterior density that the MDN learns into the likelihood function.
However A related issue arises when a restricted range of parameter values is used in the training, in which the trained network is reluctant to predict values near the boundaries of the range due to the relative lack of training events in that vicinity. This occurs even when training with a continuous parameter distribution and affects tasks other than likelihood estimation, such as simple regression. Since this issue will appear in any practical application of an MDN to estimate a physical parameter, we will discuss it.
The aim of the paper is to demonstrate the construction of MDNs for estimating continuous parameters from datasets populated only with discrete parameter values, and to discuss pitfalls that can occur in the training. This paper is structured as follows. The basic concept of Mixture Density Networks is introduced and the application to likelihood estimation is discussed. Potential biases in the network training are explained, and procedures to mitigate them are proposed, in the context of specific examples. The performance of trained MDNs is demonstrated. Finally, limitations of the technique and avenues for future improvement are discussed.

Mixture Density Networks as Likelihood Approximators
A Mixture Density Network is a neural network where the output is a set of parameters of a function, defined to be a properly normalized PDF. The outputs of the MDN can be, for example, the mean, width, and relative contributions of a (fixed) number of Gaussian distributions. But, generally speaking, the goal of MDN usage is to obtain an estimate of a posterior density function ( |x) of a data set that includes parameters and observations x.
The MDN represents the target posterior density with a weighted sum of generic basis PDF functions B , where (x) and z (x) are the x-dependent coefficients and parameters of the basis functions, which are predicted by the network, and Z = { } ∪ {z } is shorthand to represent all of the coefficients of the learned model. Typically, the conditions ∈ [0, 1] and = 1 are imposed, for example through the softmax function. In principle, the basis functions can be any set of basis PDFs. A useful choice for many applications is a mixture of Gaussian functions. (Since we are estimating a posterior density function of a continuous parameter, we might expect a minimum in the negative logarithm of this function. The minimum would be quadratic to leading order, making Gaussians a natural choice for PDF basis functions.) For that choice, the neural network's output is a set of 3 − 1 independent coefficients To train the network, in each epoch, the output function of the network˜( |x; Z) is evaluated for each point in the training data set , x . The cost function, is the negative logarithm of the product of these values. The error is backpropogated through the network in the standard way, and the cost is minimized to determine the ideal coefficients, From the physics standpoint of parameter estimation with discretized data, we are not actually interested in using the network to model the true posterior density, as one might typically do in MDN applications. (With discrete training samples, the true posterior of the training data set involves delta functions at the various values of where each template lies.) Instead, we convert˜ |x;Ẑ created by the MDN into an estimate of the likelihood function L ( |x). Using Bayes' theorem, Notably, in the mindset of estimating some parameters , as long we ensure a flat prior ( ), the likelihood that we seek and the posterior density which is output by the trained MDN differ only by an irrelevant multiplicative factor-the prior (x), which is determined by the entire training set, and does not depend on .
In order for the MDN to effectively interpolate, it is important that there not be too much freedom in the MDN output. For example, suppose there were an equal number of training templates and Gaussian components in˜( |x). The network could then essentially collapse the Gaussian functions to delta functions at each template value and reduce the cost function without limit. As long as the observed values x can reasonably be produced by multiple values of , and the number of basis functions in the MDN is kept reasonably low, the MDN will naturally be forced to interpolate between parameter points, as desired for estimating L ( |x).

Density of Parameter Points
The application of Bayes' theorem in Eq. (4) involves ( ), and simplifies if ( ) can be assumed to be flat. To ensure this condition, the locations of the templates in parameter space should be chosen to ensure an equal density of training points across the entire range.
This first requires that the templates must be equally distributed in the parameter space. Otherwise, the density of training points would be non-constant, implying a non-flat ( ). This would bias the network. Secondly, this necessitates that the parameter range extend slightly outside of the range of the templates in parameter space. For example, suppose we have one parameter and the parameter range is selected as ∈ [0, 1], and 10 templates are to be used. To ensure equal and unbiased coverage of the parameter range, they should be placed at = 0.05, 0.15, . . . , 0.95, as shown in Fig. 1. If the extra gaps (from 0 to 0.05 and from 0.95 to 1) are not included at the edges (for example, if eleven templates were used, at = 0, 0.1, . . . , 1), then the density of training data is higher at the extremal values of , again creating a non-flat ( ) and a bias in the training.
Essentially, the templates' parameter values should lie at the centers of equal-width histogram bins that extend from the lowest to highest values of . Note that, in Fig. 1, each bin has an equal amount of training data. This condition can be generalized to multidimensional parameter spaces.

PDF Normalization
Another issue arises with respect to the normalization of the MDN output. The MDN's output posterior˜( |x; Z) might not be constructed with any knowledge of the range of parameters in the training set. For example, Gaussian basis functions have infinite support and therefore will always predict a non-zero posterior density outside the range of the training data. Proper training of an MDN requires that the output posterior density be properly normalized across the selected range of for MDN training to work properly. If this is not done, parameter values near the edges of the range will be penalized because the posterior predicts parameter values to occur outside of the range, and these are never encountered in the training data.
For a one-dimensional parameter ∈ [ min , max ], one must require that This constraint can be achieved by dividing MDN's predicted posterior density by the integral of the posterior density over the parameter range. Since, during training, the posterior density is never evaluated outside of this range, this results in the proper normalization.
Practically speaking, it is easier to apply this renormalization procedure for each function B than to do it on the sum. This has the benefit that the condition = 1 is still valid. In the one-dimensional parameter case, if the cumulative distribution function (CDF) is available, . This effect is not specific to training with discrete parameter choices, and will generally occur in regions where observed data could be compatible with parameters outside the training range.

Edge Bias from Training on Templates
We now discuss a bias which arises from the discreteness of the input parameter values. As discussed in the previous sections, in order to achieve a flat ( ), we need to consider the input range of parameters to be broader than just the range where training data are located. However the cost function is only evaluated on the training data, and so the optimizer can "cheat" by overpredicting values of that are compatible with a broad range of observations, while underpredicting extremal values of that are not represented in the training data. The symptom of this is that the probability densitỹ (x| ) implied by the MDN output posterior˜( |x), when integrated over data x, does not integrate to one for all values of (as one would expect for a properly normalized density function). Rather, it is smaller than one at extremal values of and greater than one in the interior of the range. The size of this effect depends on how much the templates overlap. Fig. 2 shows examples of distributions which will demonstrate negligible and extreme bias.

Demonstration With Functional Fit
It should be emphasized that the observed edge bias is not unique to the Mixture Density Network method. Rather, it is simply a result of minimizing the cost function Eq. (2) composed of the posterior density ( |x) for a finite number of templates. To illustrate this, we will show the existence and correction of the bias in a simple functional fit. Consider a statistical model which, given some parameter value ∈ [0, 1], produces a Gaussian distribution of , a univariate variable: this could correspond to a "true" value and an "observed" value which is subject to resolution effects. For this example, we choose the distribution with = 4. We choose 10 equally-spaced values of between 0 and 1 at which to sample the model to generate training points. Since the width of each template is much broader than the total range of , the 10 templates are not very distinguishable from one another. A few of the templates are shown on the right-hand side of Fig. 2. Next, we attempt to reconstruct the joint probability distribution ( , ) by performing an unbinned maximum likelihood fit of a three-parameter function to the generated data, where A is a normalization factor. The best-fit values of the function parameters are not the ones used to generate the data.
To understand the source of the problem, it is helpful to consider the joint probability density, which is visualized in Fig. 3. We know the "true" value we want to reproduce, which is the product of the distribution ( | ) in Eq. (5)     are constant for all . While maintaining the total probability in the joint PDF constant, the fit to has recognized that a better value of the likelihood can be achieved by removing probability from the regions near = 0 and = 1 (where there are no data points to enter the likelihood computation) and placing that likelihood near = 0.5, a parameter value that is consistent with almost all values of . In other words, the functionˆis not consistent with the presumption of a flat prior ( ).
Having seen this, we can now propose a solution: we require the fit to minimize the cost in Eq. (2) while also satisfying the following constraint: This constraint prevents the optimizer from improving the cost function by implicitly modifying ( ). In practice, the constraint is applied for a few values of , for example = 0, 0.5, 1, as this was seen to be enough to correct the bias. In this case, the smoothness of the functional forms forced the constraint to apply everywhere. Applying this constraint, we redo the fit. The best-fit parameters obtained-the bottom row of Table 1-are now consistent with the true values used to generate the data. Accordingly, the joint PDF of the constrained fit (in Fig. 3) shows no bias towards = 0.5. Any remaining disagreement is attributable to statistical fluctuations in the estimation of ( ) from the finite data set.

MDN With One Observable
Now we will demonstrate how this correction is applied with a Mixture Density Network. We construct a neural network using PyTorch [8] with a single input , and outputs which are parameters of a function˜( | ) which estimates the likelihood L ( | ). (See §2 for details.) There is a single hidden layer with an adjustable number of nodes. Because of the intentional simplicity of these data, the number of nodes in this hidden layer was generally kept between 2 and 5, and we model the posterior density with just a single Gaussian function. Therefore, the two output nodes of the network are just the mean and standard deviation of this function.
A large number of values were generated at each of 10 equally-spaced discrete values of from Gaussian distributions with = and = 1. The network is trained by minimizing the cost function described in §2. At each epoch, the cost is backpropogated through the network, and the Adam minimizer [9] is used to adjust the network parameters. We find that this produces biased results. Due to the To avoid ambiguity, it is common practice in MDN applications that the output node corresponding to the Gaussian's is actually log ( ), and the exponential of this node's value is used to calculate the loss. This action effectively forces > 0.

1D Network Correction and Validation
The cost function of the network is then modified by adding an extra term, where S = std.
for some set of parameter values . With respect to Fig. 3, this extra term is the standard deviation of integrals of the joint probability˜( , ) along vertical slices in . This term ensures that these integrals are all the same, thus ensuring that ( | ) remains constant for any value of , discouraging a bias towards intermediate values of . The joint probability, is estimated from the product of the MDN output and the probability distribution over the training sample, which is estimated using histograms.
The new hyperparameter is tuned to balance the effectiveness of the additional term while avoiding the introduction of numerical instability into the minimization. In practice, its value during the training is initially set to zero, allowing the network to first loosely learn the posterior. Then, in a second step of the training, its value is increased so that the values of the two terms in Eq. network-the coefficients of the Gaussian posterior density function-correctly match expectations, as seen in Fig. 5.
Next, we demonstrate the ability of the network to reconstruct the underlying parameter given a set of observations. Each measurement of the quantity will be denoted . For each , the trained network outputs an estimate of the posterior function˜( | ). To calculate the combined cost of an entire set of measurements, { }, we take the negative logarithm of the product of the individual posteriors, The maximum likelihood estimator is defined as A likelihood-ratio test can be used to determine the statistical uncertainty of the estimator ML . An example of the functions generated for this trained network is given in Fig. 6.
To test the accuracy and precision of the network, we choose 20 equally-spaced test values in the range [0, 1]. For each , we generate a set of 10,000 measurements { }. The pseudodata is passed though the network for each , the posterior density functions calculated, and a ML is found. The uncertainty in ML is estimated using Wilks' theorem [10] (searching for values of that increase C by 0.5). If the network is accurate and unbiased, we should find statistical agreement between and ML . Indeed, we found this to be true with 2

Demonstration With 2D Network
Next, we consider a more advanced usage of MDNs. We consider a system with two inputs, x = ( 1 , 2 ). The value of 1 is sensitive to an underlying parameter , but only for small values of 2 ; when 2 is large, there is no sensitivity. This is chosen to demonstrate how this neural network technique can be an advantageous approach to parameter estimation: the MDN will automatically learn when observables offer meaningful information on parameters and when they do not.

2D Toy Model
The model used to generate 2D pseudodata comprises two components. The first is events with a small 2 value. These events have an 1 value which depends on the unknown parameter . In the toy model, this component is modeled by a Gaussian in 1 . The mean of this Gaussian varies with . In 2 , it is modeled by a Gaussian with a (lower) mean of 1.
In the other component, the value of 2 is large, and the 1 value is not related to the unknown parameter. This component is modeled with a Gamma distribution in 1 , and a Gaussian with a (higher) mean of 3 in 2 . The relative fraction of the two components is the final model parameter.
Written out, the 2D PDF takes the form where This PDF is shown for the two extremal values of in Fig. 8.

2D Network Structure, Correction, and Validation
The network is built in the following manner. First, we note that there are two inputs. Accordingly, there will be two input nodes. Next, one should note from the contours of ( 1 , 2 ) in Fig. 8, for a sample x with large 2 , the value of 1 has no predictive power of . Restated simply, the red and blue contours overlap entirely on the upper half of the plot. However, for small values of 2 , the 1 variable can distinguish different values of ; the red and blue contours are separated on the lower half of the plot. Two hidden layers are used. The output of the network is the coefficients of a mixture of Gaussian distributions. The number of Gaussians in the mixture is a hyperparameter which may be tuned to best match the particular structure of some data. In this example, the data are bimodal, so it is reasonable to expect two Gaussians to be expected. It was confirmed by trial-and-error that a mixture of two Gaussians was adequate to model these data, since it produced the correct linear response in the validation of the network. Therefore, for this network, there are six output nodes (with one normalization constraint) for a total of five independent values. The correction term presented in Eq. (7) is applied as well. To do this, (x) is estimated using a 2D histogram. It is then multiplied by the MDN output posterior ( |x) to obtain the joint probability˜( , x). Finally, the integrals in Eq. (8) are calculated. The training data consist of 100,000 samples generated according to ( 1 , 2 | ) at each of 10 equally-spaced discrete values of . To validate the network's performance, we generate test data sets {x } for various test parameter values . As in the one observable case of §5.2, each point x is passed through the network and a function˜( |x ) is calculated. We then find the maximum likelihood estimator of . An accurate network should, within statistical uncertainty, reproduce the line ML = . Indeed, Fig. 10 shows that the network is accurate and unbiased, and that it provides a reasonable uncertainty estimate.

Method Limitations
In the previous section, we have demonstrated necessary corrections for networks taking one or two values for each observation. In principle, this method could work with an arbitrarily high number of input observables. However, the current correction technique has a curse-of-dimensionality problem in that computing the correction term S defined in Eq. (8) requires an estimate of (x) over the full training data set. When determined using histograms with bins per axis, observables will require an -bin histogram to be reasonably well populated. Other techniques such as generic kernel density estimation [11] or -nearest neighbors run into the same issues eventually, although at differing rates, and the best option should be explored in each specific situation. The dimensionality issue is alleviated somewhat by the fact that only one function ( (x) over the entire training dataset) needs to be estimated, unlike methods which need to generate templates separately at each generated parameter point.
Another limitation concerns the spacing of training points. While we have demonstrated that MDNs can be trained with discrete templates of data and still interpolate properly to the full continuous parameter space, it has been necessary to use templates from parameters which are uniformly distributed, to enforce that ( ) is flat. If the parameter values are not equally spaced, this would correspond to a non-flat ( ). In principle, weights could be used during training to correct for this effect, but reconstructing ( ) from the distribution of is a density estimation problem and it is not clear if there is an optimal estimation method for the required ( ) if there are only a limited number of available. This is an area for future investigation.

Conclusions
Mixture Density Networks can output posterior density functions which robustly approximate likelihood functions and enable the estimation of parameters from data, even if the training data is only available at discrete values. This method permits one to proceed directly from (possibly multiple) observables to a likelihood function without having to perform the intermediate step of creating a statistical model for the observables, as would be required by many parameter estimation techniques. The MDN technique can be applied even with training data that provide a discrete set of parameter points, provided that the points are spaced evenly and certain corrections and constraints are applied during the training of the network.
An introductory tutorial has been implemented in a Jupyter Notebook and made public [12].