Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

There are many instances where objects consist of many data, whose values are determined by a small number of parameters. Often, it is only these parameters which are of interest.

Such a problem is very general, and has been attacked in the case of parameter estimation in large-scale structure and the microwave background (e.g. [1]). Previous work has concentrated largely on the estimation of a single parameter; the main advance of this paper is that it sets out a method for the estimation of multiple parameters. The method provides one projection per parameter, with the consequent possibility of a massive data compression factor. Furthermore, if the noise in the data is independent of the parameters, then the method is entirely lossless. i.e. the compressed dataset contains as much information about the parameters as the full dataset, in the sense that the Fisher information matrix is the same for the compressed dataset as the entire original dataset. An equivalent statement is that the mean likelihood surface is at the peak locally identical when the full or compressed data are used.

2 MOPED

A data-compression method was developed by Heavens et al. [2]. We review it here.

Describe data by a vector x i , i = 1, , N (e.g. a set of fluxes at different wavelengths). These measurements include a signal part, which we denote by μ, and noise, n:

$$\mathbf{x} = \mu + \mathbf{n}$$
(29.1)

Assuming the noise has zero mean, 〈x〉 = μ, the signal will depend on a set of parameters {θα}, which we wish to determine. For galaxy spectra, the parameters may be, for example, age, magnitude of source, metallicity and some parameters describing the star formation history. Thus, μ is a noise-free spectrum of a galaxy with certain age, metallicity etc.

The noise properties are described by the noise covariance matrix, C, with components C ij  = 〈n i n j 〉. If the noise is gaussian, the statistical properties of the data are determined entirely by μ and C. In principle, the noise can also depend on the parameters. For example, in galaxy spectra, one component of the noise will come from photon counting statistics, and the contribution of this to the noise will depend on the mean number of photons expected from the source.

The aim is to derive the parameters from the data. If we assume uniform priors for the parameters, then the a posteriori probability for the parameters is the likelihood, which for gaussian noise is

$$\begin{array}{rcl} \mathcal{L}({\theta }_{\alpha })& =&{ 1 \over {(2\pi )}^{N/2}\sqrt{\det (\bf{C} )}} \\ & & \times \exp \left [-{ 1 \over 2} \sum\limits_{i,j}({x}_{i} - {\mu }_{i}){\bf{C}}_{ij}^{-1}({x}_{ j} - {\mu }_{j})\right ].\end{array}$$
(29.2)

One approach is simply to find the (highest) peak in the likelihood, by exploring all parameter space, and using all N pixels. The position of the peak gives estimates of the parameters which are asymptotically (low noise) the best unbiased estimators. This is therefore the best we can do. The maximum-likelihood procedure can, however, be time-consuming if N is large, and the parameter space is large. The aim of this paper is to see whether we can reduce the N numbers to a smaller number, without increasing the uncertainties on the derived parameters θα. To be specific, we try to find a number N′ < N of linear combinations of the spectral data x which encompass as much as possible of the information about the physical parameters. We find that this can be done losslessly in some circumstances; the spectra can be reduced to a handful of numbers without loss of information. The speed-up in parameter estimation is about a factor ∼ 100.

In general, reducing the dataset in this way will lead to larger error bars in the parameters. To assess how well the compression is doing, consider the behaviour of the (logarithm of the) likelihood function near the peak. Performing a Taylor expansion and truncating at the second-order terms,

$$\ln \mathcal{L} =\ln {\mathcal{L}}_{\mathrm{peak}} +{ 1 \over 2} { {\partial }^{2}\ln \mathcal{L} \over \partial {\theta }_{\alpha }\partial {\theta }_{\beta }} \Delta {\theta }_{\alpha }\Delta {\theta }_{\beta }.$$
(29.3)

Truncating here assumes that the likelihood surface itself is adequately approximated by a gaussian everywhere, not just at the maximum-likelihood point. The actual likelihood surface will vary when different data are used; on average, though, the width is set by the (inverse of the) Fisher information matrix:

$${ \bf{F}}_{\alpha \beta } \equiv -\left \langle { {\partial }^{2}\ln \mathcal{L} \over \partial {\theta }_{\alpha }\partial {\theta }_{\beta }} \right \rangle$$
(29.4)

where the average is over an ensemble with the same parameters but different noise.

For a single parameter, the Fisher matrix F is a scalar F, and the error on the parameter can be no smaller than \({F}^{-1/2}\). If the data depend on more than one parameter, and all the parameters have to be estimated from the data, then the error is larger. The error on one parameter α (marginalised over the others) is at least \({\left [{({\bf{F}}^{-1})}_{\alpha \alpha }\right ]}^{1/2}\). There is a little more discussion of the Fisher matrix in [1], hereafter TTH. The Fisher matrix depends on the signal and noise terms in the following way (TTH, equation 15)

$$\begin{array}{rcl}{ \bf{F}}_{\alpha \beta } ={ 1 \over 2} \mathrm{Tr}\left [{\bf{C}}^{-1}{\bf{C}}_{,\alpha }{\bf{C}}^{-1}{\bf{C}}_{,\beta } +{ \bf{C}}^{-1}({\mu }_{,\alpha }{\mu }_{,\beta }^{t} + {\mu }_{,\beta }{\mu }_{,\alpha }^{t})\right ].& &\end{array}$$
(29.5)

where the comma indicates derivative with respect to the parameter. If we use the full dataset x, then this Fisher matrix represents the best that can possibly be done via likelihood methods with the data.

In practice, some of the data may tell us very little about the parameters, either through being very noisy, or through having no sensitivity to the parameters. So in principle we may be able to throw some data away without losing very much information about the parameters. Rather than throwing individual data away, we can do better by forming linear combinations of the data, and then throwing away the combinations which tell us least. To proceed, we first consider a single linear combination of the data:

$$y \equiv {\bf{b}}^{t}\bf{f}x$$
(29.6)

for some weighting vector b (t indicates transpose). We will try to find a weighting which captures as much information about a particular parameter, θα. If we assume we know all the other parameters, this amounts to maximising F αα. The dataset (now consisting of a single number) has a Fisher matrix, which is given in TTH (equation 25) by:

$${ \bf{F}}_{\alpha \beta } ={ 1 \over 2} \left ({ {\bf{b}}^{t}{\bf{C}}_{,\alpha }\bf{b} \over {\bf{b}}^{t}\bf{C}\bf{b}} \right )\left ({ {\bf{b}}^{t}{\bf{C}}_{,\beta }\bf{b} \over {\bf{b}}^{t}\bf{C}\bf{b}} \right ) +{ ({\bf{b}}^{t}{\mu }_{,\alpha })({\bf{b}}^{t}{\mu }_{,\beta }) \over ({\bf{b}}^{t}\bf{C}\bf{b})}.$$
(29.7)

Note that the denominators are simply numbers. It is clear from this expression that if we multiply b by a constant, we get the same F. This makes sense: multiplying the data by a constant factor does not change the information content. We can therefore fix the normalisation of b at our convenience. To simplify the denominators, we therefore maximise F αα subject to the constraint

$${ \bf{b}}^{t}\bf{C}\bf{b} = 1.$$
(29.8)

The most general problem has both the mean μ and the covariance matrix C depending on the parameters of the spectrum, and the resulting maximisation leads to an eigenvalue problem which is nonlinear in b. We are unable to solve this, so we consider a case for which an analytic solution can be found. TTH showed how to solve for the case of estimation of a single parameter in two special cases: (1) when μ is known, and (2) when C is known (i.e. doesn’t depend on the parameters). We will concentrate on the latter case, but generalise to the problem of estimating many parameters at once. For a single parameter, TTH showed that the entire dataset could be reduced to a single number, with no loss of information about the parameter. We show below that, if we have M parameters to estimate, then we can reduce the dataset to M numbers. These M numbers contain just as much information as the original dataset; i.e. the data compression is lossless.

We consider the parameters in turn. With C independent of the parameters, F simplifies, and, maximising F 11 subject to the constraint requires

$${ \partial \over \partial {b}_{i}} \left ({b}_{j}{\mu }_{,1\,j}{b}_{k}{\mu }_{,1\,k} - \lambda {b}_{j}{C}_{jk}{b}_{k}\right ) = 0$$
(29.9)

where λ is a Lagrange multiplier, and we assume the summation convention (j, k ∈ [1, N]). This leads to

$${\mu }_{,1}({\bf{b}}^{t}{\mu }_{,1}) = \lambda \bf{C}\bf{b}$$
(29.10)

with solution, properly normalised

$${ \bf{b}}_{1} ={ { \bf{C}}^{-1}{\mu }_{,1} \over \sqrt{{\mu }_{,1 }^{t }{\bf{C} }^{-1 } {\mu }_{,1}}}$$
(29.11)

and our compressed datum is the single number y 1 = b 1 t x. This solution makes sense—ignoring the unimportant denominator, the method weights high those data which are parameter-sensitive, and low those data which are noisy.

To see whether the compression is lossless, we compare the Fisher matrix element before and after the compression. Substitution of b 1 into (29.7) gives

$$\begin{array}{rcl}{ \mathbf{F}}_{11} = {\mu }_{,1}^{t}{\bf{C}}^{-1}{\mu }_{,1}& &\end{array}$$
(29.12)

which is identical to the Fisher matrix element using the full data (29.5) if C is independent of θ1. Hence, as claimed by TTH, the compression from the entire dataset to the single number y 1 loses no information about θ1. For example, if μ ∝ θ, then y 1 =  ∑ i x i  ∕  ∑ i μ i and is simply an estimate of the parameter itself.

It is important to note that y 1 contains as much information about θ1 only if all other parameters are known, and also provided that the covariance matrix and the derivative of the mean in (29.11) are those at the maximum likelihood point. We turn to the first of these restrictions in the next section, and discuss the second one here.

In practice, one does not know beforehand what the true solution is, so one has to make an initial guess for the parameters. This guess we refer to as the fiducial model. We compute the covariance matrix C and the gradient of the mean (μ, α) for this fiducial model, to construct b 1. The Fisher matrix for the compressed datum is (29.12), but with the fiducial values inserted. In general this is not the same as Fisher matrix at the true solution. In practice one can iterate: choose a fiducial model; use it to estimate the parameters, and then repeat, using the estimate as the estimated parameters as the fiducial model.

2.1 Estimation of Many Parameters

The problem of estimating a single parameter from a set of data is unusual in practice. Normally one has several parameters to estimate simultaneously, and this introduces substantial complications into the analysis. How can we generalise the single-parameter estimate above to the case of many parameters? We proceed by finding a second number y 2 ≡ b 2 t x by the following requirements:

  • y 2 is uncorrelated with y 1. This demands that b 2 t Cb 1 = 0.

  • y 2 captures as much information as possible about the second parameter θ2.

This requires two Lagrange multipliers (we normalise b 2 by demanding that b 2 t Cb 2 = 1 as before). Maximising and applying the constraints gives the solution

$${ \bf{b}}_{2} ={ { \bf{C}}^{-1}{\mu }_{,2} - ({\mu }_{,2}^{t}{\bf{b}}_{1}){\bf{b}}_{1} \over \sqrt{{\mu }_{,2 }{ \bf{C} }^{-1 } {\mu }_{,2 } - {({\mu }_{,2 }^{t }{\bf{b} }_{1 } )}^{2}}}.$$
(29.13)

This is readily generalised to any number M of parameters. There are then M orthogonal vectors b m , m = 1, …M, each y m capturing as much information about parameter α m which is not already contained in y q ;  q < m. The constrained maximisation gives

$${ \bf{b}}_{m} ={ { \bf{C}}^{-1}{\mu }_{,m} -{\sum \nolimits }_{q=1}^{m-1}({\mu }_{,m}^{t}{\bf{b}}_{q}){\bf{b}}_{q} \over \sqrt{{\mu }_{,m }{ \bf{C} }^{-1 } {\mu }_{,m } -{ \sum \nolimits }_{q=1}^{m-1}{({\mu }_{,m}^{t}{\bf{b}}_{q})}^{2}}}.$$
(29.14)

This procedure is analogous to Gram-Schmidt orthogonalisation with a curved metric, with C playing the role of the metric tensor. Note that the procedure gives precisely M eigenvectors and hence M numbers, so the dataset has been compressed from the original N data down to the number of parameters M.

Since, by construction, the numbers y m are uncorrelated, the likelihood of the parameters is obtained by multiplication of the likelihoods obtained from each statistic y m . The y m have mean 〈y m 〉 = b m tμ and unit variance, so the likelihood from the compressed data is simply

$$\begin{array}{rcl} \ln \mathcal{L}({\theta }_{\alpha }) = \mathrm{constant} -\sum\limits_{m=1}^{M}{ {({y}_{m} -\langle {y}_{m}\rangle )}^{2} \over 2} & &\end{array}$$
(29.15)

and the Fisher matrix of the combined numbers is just the sum of the individual Fisher matrices. Note once again the role of the fiducial model in setting the weightings b m : the orthonormality of the new numbers only holds if the fiducial model is correct. Multiplication of the likelihoods is thus only approximately correct, but iteration could be used if desired.

Under the assumption that the covariance matrix is independent of the parameters, reduction of the original data to the M numbers y m results in no loss of information about the M parameters at all. In fact the set y m produces, on average, a likelihood surface which is locally identical to that from the entire dataset—no information about the parameters is lost in the compression process. With the restriction that the information is defined locally by the Fisher matrix, the set {y m } is a set of sufficient statistics for the parameters {θα}. A proof of this for an arbitrary number of parameters is given in the appendix.

3 The General Case

In general, the covariance matrix does depend on the parameters, and this is the case for galaxy spectra, where at least one component of the noise is parameter-dependent. This is the photon counting noise, for which C ii  = μ i . TTH argued that it is better to treat this case by using the n eigenvectors which arise from assuming the mean is known, rather than the single number (for one parameter) which arises if we assume that the covariance matrix is known, as above. We find that, on the contrary, the small number of eigenvectors b m allow a much greater degree of compression than the known-mean eigenvectors (which in this case are simply individual pixels, ordered by | μ, α ∕ μ | ). For data signal-to-noise of around 2, the latter allow a data compression by about a factor of 2 before the errors on the parameters increase substantially, whereas the method here allows drastic compression from thousands of numbers to a handful. To show what can be achieved, we use a set of simulated galaxy spectra to constrain a few parameters characterising the galaxy star formation history.

In the case when the covariance matrix is independent of the parameters, it does not matter which parameter we choose to form y 1, y 2, etc, as the likelihood surface from the compressed numbers is, on average, locally identical to that from the full dataset. However, in the general case, the procedure does lose information, and the amount of information lost could depend on the order of assignment of parameters to m. If the parameter estimates are correlated, the error in both parameters is dominated by the length of the likelihood contours along the ‘ridge’. It makes sense then to diagonalise the matrix of second derivatives of \(\ln \mathcal{L}\) at the fiducial model, and use these as the parameters (temporarily). The parameter eigenvalues would order the importance of the parameter combinations to the likelihood. The procedure would be to take the smallest eigenvalue (with eigenvector lying along the ridge), and make the likelihood surface as narrow as possible in that direction. One then repeats along the parameter eigenvectors in increasing order of eigenvalue.

Specifically, diagonalise F αβ in (29.5), to form a diagonal covariance matrix Λ = S t FS. The orthogonal parameter combinations are ψ = S tθ, where S has the normalised eigenvectors of F as its columns. The weighting vectors b m are then computed from (29.14) by replacing μ, αp by S pr μ, αr .

4 Extension to MOPED Using an Ensemble of Fiducial Models

Unlike the case of galaxy spectra [2], in cases when the signal is very sparsly populated among the full data (e.g. light transit of an exoplanet), the fiducial model will weigh some data high, very erroneously if the fiducial model is way off from the true model. This is because the derivatives of the fiducial model with respect to the parameters are large near the walls of the box-like shape of the model.

In this section we present an alternative approach to find the best fitting transit model to a light-curve. Although the method is illustrated for the case of exo-planet searches [3], it is fully general and can be applied to other cases like gravitational wave detection [4]. The method is based on using an ensemble of randomly chosen fiducial models. For an arbitrary fiducial model the likelihood function will have several maxima one of which is guaranteed to be the correct solution. This is the case where the values of the free parameters (\(\bf{q}\)) are close to the true one; thus \(\mu (\bf{q}\)) is similar to x. For a different arbitrary fiducial model there are also several maxima, but only one will be guaranteed to be a maximum, the true one. Therefore by using several fiducial models one can eliminate the spurious maxima and keep the one that is common to all the fiducial models which is the true one. We combine the MOPED likelihoods for different fiducial models by simply averaging themFootnote 1

The new measure Y is defined:

$$Y (\bf{q}) \equiv \frac{1} {{N}_{f}} \sum\limits_{\{\bf{q}_{f}\}}\mathcal{L}(\bf{q};\bf{q}_{f})\,\,\,,$$
(29.16)

where \(\bf{q}\) and \(\bf{q}_{f}\) are the parameter vectors {T, η, θ, τ} and their fiducial values {T f , η f , θ f , τ f } and N f is the number of fiducial models. The summation is over an ensemble of fiducial models \(\{\bf{q}_{f}\}\). \(\mathcal{L}(\bf{q};\bf{q}_{f})\) is the MOPED likelihood, i.e.

$$\mathcal{L}(\bf{q};\bf{q}_{f}) = \sum\limits_{m}{\left [{b}_{m}(\bf{q}_{f}) \cdot x-{b}_{m}(\bf{q}_{f}) \cdot \mu (\bf{q})\right ]}^{2}$$
(29.17)

Figure 29.1 shows the Y as a function of period T for a different size sets of fiducial models for a synthetic light-curve with \(S/N = 3\) and 2,000 observations. The top panel shows the value of Y using an ensemble of three fiducial models. As it can be seen from the figure there are more than few minima. Using an ensemble of ten fiducial models (shown in the next panel) reduces the number of minima. In the last panel we used an ensemble of 20 fiducial models and there is only one obvious minimum, the true one.

Fig. 29.1
figure 1

Y as a function of period T for a set of fiducial models for a synthetic light-curve with \(S/N = 3\) and 2,000 observations and T = 1. 3 days. The top panel shows the value of Y using three randomly selected fiducial models, the middle panel 10 and the bottom using 20. As the number of fiducial models used increases the number of minima decreases. At N f  = 20 there is only one obvious minima at T = 1. 3 days

Figure 29.2 shows the value of Y as a function of each free parameter for a synthetic light-curve. We set the values of 3 of the parameters to the “correct” values (used to construct the light-curve) and we let the fourth free for each panel. Note that the shape of the Y as a function of η, θ and τ is smooth, however the dependency on T is erratic suggesting that efficient minimization techniques are not applicable.

Fig. 29.2
figure 2

Top panel left: Likelihood as a function of period T. Top panel right: Likelihood as a function of transit duration η. Bottom panel left : Likelihood as a function of θ and bottom panel right : Likelihood as a function of τ. In all parameters the correct value is found. Note that for T the topology of the Likelihood surface is fairly complicated with many local minima, thus making efficient minimization techniques not applicable

4.1 Confidence and Error Analysis

To confidently determine that the minimum found is not spurious the likelihood of the candidate solution must be compared to the value and distribution of Y derived from a set of light-curves with no transit signal. One can simulate a set of null light-curves and build a distribution by calculating the value of Y for each point in the parameter space for each simulated “null” light-curve; a real expensive computational task. Alternatively this null distribution can be analytically derived.

Since x ∼ N(〈x〉, σ x ) and all other variables are deterministic, then it can be shown that \(Y (\bf{q})\) follows a non-central \({\mathcal{X}}^{2}\) distribution \(Y (\bf{q}) \sim {\mathcal{X}}^{2}(r,\lambda )\) where r is the number of degrees of freedom and λ is the non centrality of the distribution. The non-central \({\mathcal{X}}^{2}\) distribution has mean and variance according to:

$$\begin{array}{rcl} \mu & =& r + \lambda \,\,\,,\end{array}$$
(29.18)
$$\begin{array}{rcl}{ \sigma }^{2}& =& 2(r + 2\lambda )\,\,\,,\end{array}$$
(29.19)

where r = 4 and λ is given by

$$\lambda = \frac{{\mathrm{E}}^{2}\left [\mathcal{X}\right ]} {\mathrm{var}\left [\mathcal{X}\right ]}\,\,.$$
(29.20)

The square of the expectation value is,

$${ \mathrm{E}}^{2}\left [\mathcal{X}\right ] = \sum\limits_{m}{\left [\langle x\rangle \,\,{B}_{m}(\bf{q}_{f}) - {C}_{m}(\bf{q};\bf{q}_{f})\right ]}^{2}$$
(29.21)

where we define

$${B}_{m}(\bf{q}_{f}) \equiv \sum\limits_{t}{b}_{m}^{t}(\bf{q}_{ f}),\ \ \mathrm{and}\ \ \ {D}_{m}(\bf{q};\bf{q}_{f}) \equiv {b}_{m}(\bf{q}_{f})\, \cdot \mu (\bf{q})$$
(29.22)

and the variance is given by

$$\begin{array}{rcl} \mathrm{var}\left [\mathcal{X}\right ]& =& \mathrm{var}\left [\sum\limits_{m}{b}_{m}(\bf{q}_{f}) \cdot x-\sum\limits_{m}{b}_{m}(\bf{q}_{f}) \cdot \mu (\bf{q})\right ] \\ & =& \sum\limits_{m}{\left \vert {b}_{m}(\bf{q}_{f})\right \vert }^{2}\mathrm{var}\left [{x}^{t}\right ] = {\sigma }_{ x}^{2}{\beta }_{ m}(\bf{q}_{f}) \end{array}$$
(29.23)

where we define \({\beta }_{m}(\bf{q}_{f})\) to be:

$${\beta }_{m}(\bf{q}_{f}) \equiv {b}_{m}(\bf{q}_{f}) \cdot {b}_{m}(\bf{q}_{f})\,\,\,.$$
(29.24)

Combining the above equations we get

$$\lambda = \frac{\sum\limits_{m}{\left [\langle x\rangle \,\,{B}_{m}(\bf{q}_{f}) - {D}_{m}(\bf{q};\bf{q}_{f})\right ]}^{2}} {{\sigma }_{x}^{2}\,{\beta }_{m}(\bf{q}_{f})}$$
(29.25)

To compute confidence levels for a particular Y we integrate a non-central \({\mathcal{X}}^{2}\) distribution with non centrality given by (29.25) from \(Y (\bf{q})\) to infinity. This is done numerically, still this is a very quick operation. Furthermore, this will only be performed few times per light curve.

Figure 29.3 shows the values of Y (T) for the null case (i.e. a light-curve without a transit) both simulated (crosses) and theoretically calculated using the equations above (solid line is the expected value and dotted line is the 80% confidence level) (Fig. 29.4). It is clear that the simulated values agree well with the theoretical ones. Note that because the confidence can be calculated analytically we do not have to simulate null light-curves and recalculate the Y for each light-curve thus gaining computational speed.

Fig. 29.3
figure 3

Values of Y (T) for the null case (i.e. a light-curve without a transit) both simulated (crosses) and analytically calculated (see Sect. 29.4.1) (solid line is the expected value and dotted line is the 67% confidence level). It is clear that the simulated values agree well with the theoretical ones

Fig. 29.4
figure 4

The value of Y is shown as a function of period for a synthetic light-curve with a transit at 1. 25 days. The different panels show different values of S ∕ N. Note that there is a well defined minimum at the right period. The dotted line shows the 80% confidence level. Note that at this level there is only one single minimum at the right period even for S ∕ N as low as 5