PML versus minimum χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\chi }^{2}$$\end{document}: the comeback

Arellano (J Econ 42:247–265, 1989a) showed that valid equality restrictions on covariance matrices could result in efficiency losses for Gaussian PMLEs in simultaneous equations models. We revisit his two-equation example using finite normal mixtures PMLEs instead, which are also consistent for mean and variance parameters regardless of the true distribution of the shocks. Because such mixtures provide good approximations to many distributions, we relate the asymptotic variance of our estimators to the relevant semiparametric efficiency bound. Our Monte Carlo results indicate that they systematically dominate MD and that the version that imposes the valid covariance restriction is more efficient than the unrestricted one.


Introduction
Maximum likelihood and minimum chi-square methods have been competing for the estimator throne for a long time.At the turn of the nineteenth century, Legendre (1805) and Gauss (1809) put forward least squares estimation as a Gaussian-based alternative to Laplace's (1774) least absolute deviation method, which relied on his eponymous distribution.Almost a century later, Pearson proposed not only the method of moments (see Pearson 1894), but also the chi-square criterion in the context of matching theoretical and empirical frequencies (see Pearson 1900).In turn, the development of maximum likelihood estimation (MLE) by Fisher (1922Fisher ( , 1925) ) was one of the most important achievements in twentieth-century statistics.Under standard regularity conditions, MLE asymptotically achieves the Cramér-Rao lower bound (see Cramér 1946;Rao 1945), which makes it at least as good as any minimum χ 2 estimator.In addition, it achieves second-order efficiency after a bias correction (see Rao 1961).Moreover, the imposition of valid equality restrictions on the parameters systematically leads to efficiency gains (see Rothenberg 1973).
However, not everybody was convinced (see Neyman and Scott (1948) on the incidental parameter problem, as well as the inconsistent MLE examples in Basu (1955), Kraft and Le Cam (1956), Bahadur (1958)), and minimum χ 2 methods remained popular.In fact, Berkson (1980) argued that ML was often just a special case of minimum χ 2 , and not necessarily the best one.Soon afterwards, White (1982), building on earlier work by Huber (1967), andGouriéroux et al. (1984) studied the properties of pseudo-MLEs, characterising their consistency and general inefficiency.Arellano (1989a) put another nail on the ML coffin by showing that valid equality restrictions could result in efficiency losses for Gaussian PMLEs.Arguably, the wooden stake to the heart was driven by Newey and Steigerwald (1997), who described the inconsistency of non-Gaussian PMLE procedures under distributional misspecification.Since then, graduate students with non-Bayesian teachers learn the normal distribution only, and Gaussian PMLE is just an example of Hansen (1982) Generalised Method of Moments (GMM).In this paper, though, we argue that non-Gaussian PMLE, like a B-movie vampire, deserves a second life (or death).
We do so by revisiting the two-equation textbook example in Arellano (1989a),1 except that instead of basing PMLE on the Gaussian distribution, as he did, we use discrete mixtures of normals.The reason is twofold.First, Fiorentini and Sentana (2023) show that, under standard regularity conditions, such estimators are consistent for the conditional mean and variance parameters regardless of the true distribution of the shocks to the model and the number of mixture components, thereby nesting the results for Gaussian PMLE in Gouriéroux et al. (1984) while simultaneously avoiding the concerns raised by Newey and Steigerwald (1997).Second, finite normal mixtures with a sufficiently large number of components can provide good approximations to many distributions (see Nguyen et al. 2020), so it is reasonable to conjecture that PMLEs based on them may get close to achieving the semiparametric (SP) efficiency bound, and therefore exploit the potential adaptivity of some of the parameters when it exists, at least asymptotically. 2he rest of the paper is organised as follows.Section 2 introduces the example in Arellano (1989a) and summarises his main results.Then, in Sect. 3 we extend those results to the entire parameter vector, derive the relevant semiparametric efficiency bounds, and use them to benchmark the different estimators, including the PMLEs based on finite Gaussian mixtures.Next, Sect. 4 contains the results of our extensive Monte Carlo experiments, while Sect. 5 concludes.Proofs and auxiliary results are relegated to the appendices.

The example
Consider the following textbook example: (1) with As is well known, the unrestricted Gaussian PMLE of α and β coincides with the IV estimator that uses a constant, z 1 and z 2 as instruments in the first equation.In turn, the restricted Gaussian PMLE that imposes σ 12 = 0 coincides with the OLS estimator of the first equation.
When the joint conditional distribution of u 1 and u 2 is Gaussian, OLS is at least as efficient as IV, which justifies the Durbin-Wu-Hausman test. 3But Arellano (1989a) seemingly counterintuitive result says that when the true conditional distribution is not Gaussian, IV may be more efficient than OLS for α and β even though σ 12 = 0. Specifically, he showed that IV will beat OLS if and only if where is the co-kurtosis coefficient between the two structural shocks and ρ y 2 z 2 .z 1 is the correlation coefficient between y 2 and z 2 after partialling out the effect of z 1 .Intuitively, Notes: When the R 2 of Eq. ( 2) coincides with ρ 2 y 2 z 2 .z 1 , the relative efficiency of the OLS/IV estimators of α is given by y 2 z 2 .z 1 )μ 22 + ρ 2 y 2 z 2 .z 1 ]ρ 2 y 2 z 2 .z 1 .The solid line denotes the boundary line μ 22 = 1 + ρ −2 y 2 z 2 .z 1 , while the dotted line denotes the locus of (ρ y 2 z 2 .z 1 , μ 22 ) combinations for which the IV estimator of α reaches its maximum asymptotic efficiency relative to the corresponding OLS estimator, which is given by ρ 2 y 2 z 2 .z 1 = 1 2 μ 22 /(μ 22 − 1) μ 22 affects the correct sandwich version of the asymptotic covariance matrix of the OLS estimators of the slope parameters.Appendix A contains detailed expressions for the asymptotic variances of the OLS and IV estimators of α and β.We have used those expressions to create Fig. 1, which displays in (ρ y 2 z 2 .z 1 , μ 22 ) space (minus one plus) the ratio of the asymptotic variances of the OLS and IV estimators of α for positive values of ρ y 2 z 2 .z 1 . 4We do so for the special case in which the R 2 of Eq. ( 2) coincides with ρ 2 y 2 z 2 .z 1 , which allows this parameter to vary freely from 0 to 1. 5 As expected, OLS is more/less efficient than IV to the left/right of the boundary line (3).
This figure also shows the locus of (ρ y 2 z 2 .z 1 , μ 22 ) combinations for which the IV estimator of α reaches its maximum asymptotic efficiency relative to the corresponding OLS estimator in this set-up, which is given by the curve .
Further increases in ρ y 2 z 2 .z 1 for a given μ 22 result in decreases in relative efficiency, with OLS and IV becoming indistinguishable as ρ y 2 z 2 .z 1 → 1, in which case z 2 becomes a perfect instrument for y 2 .
Notes: When the R 2 of Eq. ( 2) coincides with ρ 2 y 2 z 2 .z 1 , the relative efficiency of the MD/OLS and MD/IV estimators of α are given by and , respectively.The solid line denotes the boundary line In this context, Arellano (1989a) proposed solution is to replace Gaussian PMLE by Minimum Distance (MD) estimators, a special case of minimum chi-square methods popularised in econometrics by Malinvaud (1970).The rationale is as follows.Let θ = (γ, α, β, μ 0 , μ 1 , μ 2 , σ 2 1 , σ 2 2 ) denote the vector of structural parameters.Given that the reduced form of model ( 2) is which is exactly identified, the unrestricted MD estimator coincides with IV, which is Indirect Least Squares.Then, Arellano (1989a) shows that imposing the restriction σ 12 = 0 leads to an overidentified optimal MD procedure (weakly) more efficient than both IV and OLS for α and β.This optimal MD estimator requires an asymptotic covariance of the reduced form parameter estimators which recognises that the third-and fourth-order multivariate cumulants of u 1 and u 2 are not usually 0 when they are jointly non-normally distributed.2) coincides with ρ 2 y 2 z 2 .z 1 , the relative efficiency of the ML/MD estimators of α and β is, respectively, given by AVar( AVar( and AVar( , where Appendix A also contains detailed expressions for the asymptotic variances of the optimal MD estimators of α and β.We have used those expressions to create Fig. 2, which depicts in (ρ y 2 z 2 .z 1 , μ 22 ) space (minus one plus) the ratio of the asymptotic variance of the restricted optimal MD of α to the asymptotic variance of either the OLS estimator (to the left of (3)) or the IV one (to its right) in the same set-up as Fig. 1.As can be seen, the efficiency gains are relatively small over the displayed range, and they vanish when either the partial correlation goes to 0 or 1 or the co-kurtosis term goes to 0. 6The predictable reaction of a fervent ML believer to Figs. 1 and 2 would be to argue that condition (3) requires the combination of a very good instrument (a high ρ 2 y 2 z 2 .z 1 ) with a substantial amount of non-normality (a large μ 22 ), in which case the Gaussian assumption would be very inappropriate.For example, a joint Student t distribution for u 1 and u 2 cannot satisfy this condition when the number of degrees of freedom is six or more, and the requirement becomes increasingly difficult for poor instruments.
A naïve ML solution would be to assume that u 1 and u 2 follow a bivariate Student t distribution to estimate the model parameters, which should dominate MD.In this respect, we have used the expressions in Appendix A to create Fig. 3a, b, which display in (ρ y 2 z 2 .z 1 , μ 22 ) space (minus one plus) the ratio of the asymptotic variances of the t-based MLE of α and β that impose σ 12 = 0 to the asymptotic variances of the corresponding restricted optimal MD.As can be seen, these figures confirm that ML does indeed dominate MD in this case.
The problem with this naïve approach is that if the assumed joint distribution is incorrect, the resulting PMLEs may be inconsistent, as forcefully argued by Newey and Steigerwald (1997).
However, this does not mean that all parameters will be inconsistently estimated.Specifically, Proposition 3 in Fiorentini and Sentana (2019) implies that the unrestricted t-based PMLEs of α and β are always consistent irrespective of the true distribution.Similarly, their Proposition 1 implies that the restricted t-based PMLEs of α and β will remain consistent when the conditional distribution of σ −1 1 u 1 and σ −1 2 u 2 is elliptical even though it does not coincide with the distribution assumed for estimation purposes.Besides, it may be possible to obtain two-step consistent estimators in closed-form along the lines of Fiorentini and Sentana (2019).
More importantly, Fiorentini and Sentana (2023) show that all parameters will always be consistently estimated if one assumes for estimation purposes that u 1 and u 2 follow a finite mixture of bivariate normals regardless of the true distribution of those innovations and the number of components of the mixture, as long as the shape parameters are simultaneously estimated with the mean and variance parameters.7 Thus, the consistency of the Gaussian PMLE is just a special case.
The ability of finite Gaussian mixtures to approximate many other distributions mentioned in the introduction suggests that we may be able to relate these finite mixture PMLEs to SP estimators which simply exploit the independence of the shocks and the conditioning variables without making any parametric assumptions.For that reason, in the next section we take SP estimators as our benchmark to study: 1.The efficiency of the OLS, IV, MD and correct ML estimators relative to SP ones, 2. The relative efficiency of restricted and unrestricted versions of these SP estimators, and 3.The relative efficiency of finite mixture-based PMLEs relative to SP estimators in the context of model (2).

Minimum distance revisited
Although the main focus of the analysis in Arellano (1989a) was α and β, it is of some interest to study the asymptotic efficiency of the optimal MD estimators of the remaining structural model parameters relative to their OLS and IV counterparts.Given that the number of different bivariate cumulants of orders 3 and 4 is 4 and 5, respectively, we focus on the special case in which the joint distribution of the (standardised) structural shocks conditional on the instruments is spherical, or s(0, I 2 , η) for short, where η is the possibly infinite vector of shape parameters.More formally, Assumption 1 To simplify the expressions further, we are going to follow Appendix B in Fiorentini and Sentana (2019) and re-parametrise the unrestricted covariance matrix of the structural residuals as where ψ 12 is the coefficient in the least squares projection of u 2 on u 1 , and σ 2 and ω the geometric mean of their variances and the natural log of the ratio of the standard deviations of these shocks, respectively, under the maintained assumption that they are uncorrelated. 8Let θ † = (γ, α, β, μ 0 , μ 1 , μ 2 , ω, σ 2 ) denote the vector of structural parameters implied by (8) under the restriction ψ 12 = 0. Using the expressions for the Jacobian linking θ † and θ in (A17), we can then show under standard regularity conditions that: denote the means, variances and covariance of z 1 and z 2 .If Assumption 1 holds, then: (a) The difference between the asymptotic covariance matrices of the OLS and MD estimators of θ † , θ † L S and θ † M D , respectively, is positive semidefinite of rank 1 at most, with a basis for its image given by and a basis for its kernel by and (b) The difference between the asymptotic covariance matrices of the IV and MD estimators of θ † , θ † I V and θ † M D , respectively, is positive semidefinite of rank 1 at most, with the same basis for image and kernel.(c) The difference between the asymptotic covariance matrices of the OLS and IV estimators of θ † , θ † L S and θ † I V , respectively, is positive/negative semidefinite of rank 1 depending of condition (3), with exactly the same basis for image and kernel.
This proposition considerably sharpens the results in Arellano (1989a) for the special case of spherically symmetric disturbances by showing that the asymptotic efficiency gains concentrate in a single linear combination of the parameters of the first equation γ , α and β given by (9).In contrast, any other linear combination of the parameters orthogonal to this one does not generate any efficiency gains.Specifically, the parameters of the second equation and the residual variances are estimated just as efficiently by the three procedures.

Semiparametric estimation and efficiency bounds
The optimal instruments theory of Chamberlain (1987) implies that Arellano (1989a) MD estimator achieves the SP efficiency bound which exploits the correct specification of the conditional mean and variance functions for y 1 and y 2 in the reduced form model (2) when the joint third-and fourth-order cumulants of u 1 and u 2 conditional on z 1 and z 2 are constant.However, if this last maintained assumption is true, then one can in principle obtain an even more efficient MD estimator of the model parameters after augmenting it with equations for the third-and fourth-order cumulants of the reducedform residuals under the assumption that the joint cumulants of u 1 and u 2 conditional on z 1 and z 2 are constant up to the eighth-order.
In fact, the results in Bickel et al. (1993) allow us to obtain the SP efficiency bound which exploits that the joint distribution of u 1 and u 2 is independent of z 1 and z 2 .Moreover, we can also consider a restricted version of this SP bound under the maintained assumption that (7) holds, as in Hodgson and Vorkink (2003), which will be bigger in the usual positive semidefinite sense.Henceforth, we shall refer to this bound and its associated estimator by the abbreviation SS, reserving SP for the one which does not impose sphericity.
An interesting question in this context is the possibility that some but not all of the parameters of model ( 2) can be partially adaptively estimated, in the sense that their SP estimators are as asymptotically efficient as the infeasible ML estimators which exploit the information of the true distribution of the shocks, including the values of their shape parameters.The following proposition provides a precise answer to this question under sphericity for the restricted estimators that impose σ 12 = 0: (a) The difference between the asymptotic covariance matrices of the restricted SS and infeasible ML estimators of θ † , θ † SS and θ † M L ( η), respectively, is positive semidefinite of rank 1 at most, with a basis for its image given by (0 1×7 , 1), and a basis for its kernel by (I 7 , 0 7×1 ).(b) The difference between the asymptotic covariance matrices of the restricted SP and infeasible ML estimators of θ † , θ † S P and θ † M L ( η), respectively, is positive semidefinite of rank 5 at most, with a basis for its image given by (1, 0 1×7 ), ( 0 ) and (0 2×6 , I 2 ), and a basis for its kernel by (c) The difference between the asymptotic covariance matrices of the MD and SP estimators of θ † , θ † M D and θ † S P , respectively, is positive semidefinite of rank 4 at most, with a basis for its image given by (0 2×1 , I 2 , 0 2×5 ) and (0 2×4 , I 2 , 0 2×2 ), and a basis for its kernel by (0 The first part of the proposition implies that all the structural model parameters except the overall residual scale σ 2 can be (partially) adaptively estimated by the SS estimator, as expected from Proposition 12 in Fiorentini and Sentana (2021).
More interestingly, the second part of the proposition implies that in addition to μ 1 and μ 2 , the coefficient of the linear projection of y 1 onto a constant and z 1 , which is given by will be adaptively estimated by the restricted SP estimator.In this respect, a very important by-product of this proposition is that the model parameters that can be partially adaptively estimated often continue to be consistently estimated under distributional misspecification of the innovations, as shown by Fiorentini andSentana (2019, 2021) in the context of multivariate location-scale models such as (1)-( 2).We will revisit this issue in the Monte Carlo section.Finally, the last part of the proposition says that the variances of the structural-form residuals, as well as the intercepts in the reduced-form regressions of y 1 and y 2 on a constant and the demeaned values of z 1 and z 2 , which are given by γ + τ 1 (β + αμ 1 ) + τ 2 (αμ 2 ) and μ 0 + τ 1 μ 1 + τ 2 μ 2 , respectively, are asymptotically equally efficiently estimated by the MD and SP estimators.More importantly, it also says that the efficiency gains are concentrated in the four slope coefficients of the two structural equations.
It would be tedious but otherwise straightforward to extend Propositions 1 and 2 to the case in which the distribution of the shocks conditional on z 1 and z 2 is not spherical as a function of the four third-order and five fourth-order cumulants of u 1 and u 2 .In fact, there is one important instance in which those higher-order cumulants would be unnecessary for the comparisons.Specifically, we can use Proposition 13.2 in Fiorentini and Sentana (2021) to prove that, subject to regularity, both the parameters of the unrestricted covariance matrix of the reduced-form residuals and the intercepts in the reduced-form regressions of y 1 and y 2 on a constant and the demeaned values of z 1 and z 2 will be as efficiently estimated by the IV estimator and the unrestricted SP estimator, while the slopes will always be adaptively estimated, just as in the second part of Proposition 2. The reason is twofold.First, the information matrix, the feasible parametric efficiency bound, the SP bound, and the usual Gaussian sandwich formula become block-diagonal between those reduced-form parameters and the four structural slope coefficients α, β, μ 1 and μ 2 .In turn, this block diagonality leads to a saddle-point characterisation of the asymptotic efficiency of the SP estimator of θ, with the slope coefficients being adaptive and the others only reaching the efficiency of the Gaussian PMLE.

Efficiency gains from the equality constraint
It is also of interest to analyse the effects of imposing the covariance restriction σ 12 = 0 on the different estimators we have considered: (a) The difference between the asymptotic covariance matrices of the unrestricted and restricted infeasible ML estimators of θ † , θ † M L and θ † M L , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by ( 9), and a basis for its kernel by ( 10), ( 11) and ( 12).(b) The difference between the asymptotic covariance matrices of the unrestricted and restricted SS estimators of θ † , θ † SS and θ † SS , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by ( 9), and a basis for its kernel by ( 10), ( 11) and ( 12).(c) The difference between the asymptotic covariance matrices of the unrestricted and restricted SP estimators of θ † , θ † S P and θ † S P , respectively, is positive semidefinite of rank 1 at most, with the basis for its image given by ( 9), and a basis for its kernel by ( 10), ( 11) and ( 12).Therefore, when one uses "efficient" estimators, the imposition of the valid equality constraint σ 12 = 0 always leads to (weak) efficiency gains for exactly the same linear combination of the parameters of the first structural equation for which optimal MD leads to an efficiency gain relative to both OLS and IV.In fact, it is straightforward to generalise (a) so that it applies to the feasible parametric ML estimators of θ † which simultaneously estimate the finite vector of shape parameters η, as well as to the ML estimators of these parameters themselves.This is in contrast to the seemingly counterintuitive result in Arellano (1989a), which simply reflects the fact that OLS does not use the optimal MD weighting matrix in the non-normal case.

Finite mixtures as sieves
Finally, we study the extent to which PMLEs based on finite mixtures of normals with an increasing number of components could constitute a proper sieves-type SP procedure, as we argued in the introduction.
We do so first when the standardised shocks to model (2) conditional on z 1 and z 2 follow a bivariate Student t with 0 means, unit standard deviations, no correlation and 5 degrees of freedom but whose parameters are estimated by finite scale mixture-based log-likelihood functions with K = 2, 3 and 4 components.For comparison purposes, we consider four different benchmarks that impose the restriction σ 12 = 0: (i) the MLE based on the correctly specified log-likelihood function that fixes the number of degrees of freedom to 5, (ii) the SS estimator, (ii) the OLS estimator, and (iv) the optimal MD estimator.
We compute the expected value of the Hessian and outer product of the score of the scale mixture-based PMLEs by means of large sample averages of the analytical expressions in Fiorentini and Sentana (2021) evaluated at the true values of the mean and variance parameters in θ and the pseudo-true values of the shape parameters, which we numerically obtain from samples of millions of simulated observations.
The results, which we report in Table 1, show that the scale mixture-based PMLEs of all the model parameters except the overall residual scale σ 2 quickly approach the asymptotic efficiency of the infeasible MLE despite the fact that no finite scale mixture of normals can approximate the unbounded higher-order moments, tail behaviour or nonlinear tail dependence of a multivariate Student t.In fact, although panel (a) in Fig. 3 of Gallant and Tauchen (1999) clearly illustrates that a more complex misspecified model does not necessarily lead to more efficient estimators because one is not simply adding new elements to the score, but also changing the pseudo-true values of the shape parameters at which one evaluates the original components of the score, we find that the efficiency improvements occur monotonically. 9As a result, it seems that the covariance matrix of the errors in the least squares projection of the scores of the true model onto the scores of the mixture-based log-likelihood becomes smaller and smaller as K increases (see Proposition 7 in Calzolari et al. 2004).
In contrast, the asymptotic variances of the scale mixture-based PMLEs of σ 2 coincides with the asymptotic variances of the OLS estimators irrespective of the number of components, which reflects (i) the block diagonality of the different asymptotic covariance matrices in Proposition 12.2 of Fiorentini and Sentana (2021) because the , and σ z 1 z 2 = 0. OLS denotes the usual ordinary least squares estimator, PML-SMN K denotes Pseudo-ML based on a bivariate scale mixture of K normals, SS denotes the spherically symmetric SP estimator, ML denotes MLE which exploit the information of the true distribution of the shocks, including the degrees of freedom, and MD denotes the optimum minimum distance estimator.We compute the expected value of the Hessian and variance of the score of the finite mixture-based PMLEs by means of large sample averages of the analytical expressions in Fiorentini and Sentana (2021) evaluated at the true values of the mean and variance parameters in θ and the pseudo-true values of the shape parameters, which we numerically obtain from samples of millions of simulated observations determinant of ( 8) is precisely σ 4 , and (ii) the fact that the ML estimators of the mean in a scale mixture of K gammas is numerically the same regardless of K , as explained in Fiorentini and Sentana (2023).
We then conduct a similar exercise when u 1 and u 2 conditional on z 1 and z 2 follow a bivariate asymmetric Student t with 0 means, unit standard deviations, no correlation, negative tail dependence and the same μ 22 as in the symmetric case.We estimate the unrestricted model parameters using general finite mixture-based log-likelihood functions with K = 2, 3 and 4 components, and consider as benchmarks the following three unrestricted estimators: infeasible MLE, SP, and IV.In this case, we compute the expected value of the Hessian and outer product of the score of the mixturebased PMLEs using large sample averages of the theoretical expressions in Amengual et al. ( 2023) evaluated at the true values of the mean and variance parameters and the pseudo-true values of the shape parameters obtained from very large samples of simulated observations.
The results we report in Table 2 show that the mixture-based PMLEs of the slope parameters approach the asymptotic efficiency of the infeasible MLE despite the fact that no finite mixture of normals can approximate the unbounded higher-order moments, tail behaviour or nonlinear tail dependence of a multivariate asymmetric Student t.Again, we find that the efficiency improvements occur monotonically.In contrast, the asymptotic variances of the mixture-based PMLEs of the intercepts and covariance matrix of the reduced form in mean-deviation form coincide with the , and σ z 1 z 2 = 0. IV denotes the usual instrumental variables estimator, PML-MN K denotes Pseudo-ML based on a bivariate mixture of K normals, SP denotes the semiparametric estimator, ML denotes MLE which exploit the information of the true distribution of the shocks, including the degrees of freedom.Moreover, E(y 1 ) and E(y 2 ) are short-hand for γ + τ 1 (β + αμ 1 ) + τ 2 (αμ 2 ) and μ 0 + τ 1 μ 1 + τ 2 μ 2 , respectively.We compute the expected value of the Hessian and variance of the score of the finite mixture-based PMLEs using large sample averages of the theoretical expressions in Amengual et al. (2023) evaluated at the true values of the mean and variance parameters and the pseudo-true values of the shape parameters obtained from very large samples of simulated observations asymptotic variances of the corresponding IV estimators irrespective of the number of components, which reflects the fact that the ML estimators of the mean vector and covariance matrix in mixtures of K normals are numerically the same for any K ≥ 1 (see also the discussion at the end of Sect.3.2).

Monte Carlo analysis
In previous sections, we have derived several asymptotic results regarding the relative efficiency of the LS, IV and MD estimators, as well as the finite mixture-based PMLEs, the SS estimators, and the feasible and infeasible MLEs.In this section, in contrast, we make use of an extensive Monte Carlo simulation exercise to asses their small sample behaviour.
87, which corresponds to the dotted line in Fig. 1; and  c. μ 22 = 7/3 and ρ y 2 z 2 .z 1 = (μ 22 − 1) − 1 2 = √ 3/2 0.87, which is another case of equal efficiency of IV and OLS, but with lower co-kurtosis. 10s for the distribution of the structural shocks, we consider four non-Gaussian possibilities in which (u 1 , u 2 ) follow a: For illustrative purposes, we display the joint densities and contours for standardised versions of these distributions in comparison with the bivariate spherical Gaussian distribution in Figs. 4 and 5 for the spherically symmetric and general cases, respectively.
In all simulated samples, the exogenous variables z = (z 1 , z 2 ) are generated according to a bivariate Student t distribution with 8 degrees of freedom with mean vector τ = (1, 1) and an identity variance covariance matrix. 11ext, for each choice of the partial correlation ρ y 2 z 2 .z 1 mentioned above, we choose   of generality, these restrictions implicitly determine the variance of the error term of the second equation as σ 2 2 = 1− R 2 2 .We also impose the same balancing restriction on the slopes of the first equation by choosing Then, we fix R 2 1 to 0.5, which implies σ 2 1 = 1/2, an arbitrary choice that simply scales the asymptotic variances of all the different estimators of α and β by the same amount (1 − R 2 1 ). 12Finally, we choose the values of the intercepts γ and μ 0 so that E(y 1 ) = E(y 2 ) = 1 (see Appendix C for further details).
We display the finite sample results by means of the box plots in Figs. 6, 7, 8, 9, 10 and 11, which concentrate on α and β, the two parameters of interest.Figures 6, 7 and 8 show the Monte Carlo results for 250 observations for cases a., b. and c., respectively, while Figs. 9, 10 and 11 contain the results for 1000 observations in the same order.
Our findings indicate that OLS is better in finite samples than what the asymptotic theory suggests because the sample co-kurtosis coefficient is downward biased for μ 22 .In fact, the asymptotic efficiency of the IV estimator of α relative to LS can only be observed in panels b and d of Fig. 10 when the sample length is large and the In all DGPs, we set σ 2 1 = 1 so that R 2 1 = 1/2.In order to have a maximum relative efficiency of IV versus LS, we set ρ y 2 z 2 .z 1 = √ 3/2 so that σ 2 2 = 1/7 and, therefore, γ = 0.22, α = β = 0.39, μ 0 = 0.31 and μ 1 = μ 2 = 0.65 distribution of the shocks is either a scale or a general finite mixture of normals, which is when there seems to be a lower small sample bias for μ 22 .They also confirm that optimal MD dominates both OLS and IV in finite samples, but the need to estimate third-and fourth-order multivariate cumulants to compute the optimal weighting matrix handicaps it somewhat (see Altonji and Segal (1996)) for analogous results in the context of optimal GMM estimators when the shocks are fat-tailed).In all DGPs, we set σ 2 1 = 1 so that R 2 1 = 1/2.In order to have a tie between IV and LS, we set ρ y 2 z 2 .z 1 = √ 3/2 so that σ 2 2 = 1/7 and, therefore, γ = 0.22, α = β = 0.39, μ 0 = 0.31 and μ 1 = μ 2 = 0.65 Our results also indicate that non-Gaussian PML based on a restrictive parametric distribution like the Student t or a discrete scale mixture of normals works well when the true distribution is spherical, but it generates inconsistencies otherwise when we impose the constraint σ 12 = 0. Notice, though, that the unrestricted estimators are always consistent for the slope parameters while the restricted estimators seem to be consistent for β + μ 1 α despite being inconsistent for both α and β, which is in line with our theoretical discussion following Proposition 2. In all DGPs, we set σ 2 1 = 1 so that R 2 1 = 1/2.In order to have a tie between IV and LS, we set ρ y 2 z 2 .z 1 = 1/ √ 2 so that σ 2 2 = 1/3 and, therefore, γ = 0.20, α = β = 0.40, μ 0 = 0.15 and μ 1 = μ 2 = 0.58 In turn, the performance of the two-step SS estimators is very similar to the performance of the corresponding parametric estimators, although their finite sample variances are larger than what the asymptotic theory predicts.Specifically, the consistency pattern of the restricted and unrestricted SS estimators is almost identical.
More importantly, we find that non-Gaussian PMLEs based on a flexible distribution like a general finite mixture of normals works well in practice regardless of the true distribution, systematically dominating MD.In addition, the version that imposes the valid covariance restriction σ 12 = 0 is always more efficient than the unrestricted one.In all DGPs, we set σ 2 1 = 1 so that R 2 1 = 1/2.In order to have a maximum relative efficiency of IV versus LS, we set ρ y 2 z 2 .z 1 = √ 3/2 so that σ 2 2 = 1/7 and, therefore, γ = 0.22, α = β = 0.39, μ 0 = 0.31 and μ 1 = μ 2 = 0.65

Directions for further research
As we mentioned at the end of Sect.3.2, it would be useful to generalise our theoretical results dropping the assumption of spherical symmetry.Similarly, and although we have seen that our proposed finite mixture-based PMLEs get close to achieving the SP efficiency bound both under sphericity and in general, an obvious extension of our Monte Carlo experiments would be to consider standard two-step SP estimators that starting from a consistent estimator such as OLS carry out one BHHH iteration using the efficient SP score estimated nonparametrically without imposing spherical In all DGPs, we set σ 2 1 = 1 so that R 2 1 = 1/2.In order to have a tie between IV and LS, we set ρ y 2 z 2 .z 1 = √ 3/2 so that σ 2 2 = 1/7 and, therefore, γ = 0.22, α = β = 0.39, μ 0 = 0.31 and μ 1 = μ 2 = 0.65 symmetry.The curse of dimensionality in estimating multivariate densities, though, might further reduce the theoretical advantages of this method in finite samples.
Another worthwhile exercise would be to extend the analysis in this paper to the general simultaneous equation model with an arbitrary numbers of endogenous variables and instrumental ones considered by Arellano (1989a).Aside from involving more complex analytical expressions than in the bivariate example we have considered, the main practical complication would be that the number of free parameters of a standardised multivariate mixture increases with the square of the cross-sectional dimension, as we explain in Appendix D.
Last, but not least, deriving a formal result that shows that finite Gaussianmixture-based PMLEs may provide a proper sieve ML estimator when the number of components increases at a suitable rate constitutes a particularly interesting avenue for further research.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

A.1 Instrumental variables (IV)
where In this context, the unrestricted Gaussian PMLE of α and β coincides with the IV estimator that uses a constant, z 1 and z 2 as instruments in the first equation.To consider both equations at once, let ϑ = (θ , σ 12 ) and where and is the inverse of the (lower) Cholesky decomposition of .We can then exploit Proposition C2 in Supplementary Appendix C of Fiorentini and Sentana (2021) to obtain where if we use = 0 to denote normality and K mn for the commutation matrix of orders m and n (see, e.g.Magnus and Neudecker (2019)).
Given that the assumption of constant conditional higher-order cumulants applies to the structural model, though, we need to relate the higher-order moments of the reduced form residuals to those of the structural ones.Defining where L 2 and D 2 are the elimination and duplication matrices of order 2, respectively (see Magnus and Neudecker ( 2019)), we will have that and where For our purposes, it is convenient to rewrite these expressions as where R 2 1 and R 2 2 are the population coefficients of determination of Eqs. ( 1) and ( 2), respectively, and ρ y 2 z 2 .z 1 the correlation coefficient between y 2 and z 2 after partialling out the effect of z 1 .

A.3 Optimum minimum distance (MD)
Let c = vec(C) and ω = vech( ) denote the parameters of the unrestricted reduced form model. From Eqs. ( 5)-( 6), we will have that Let φLS = ( c10 , c11 , c12 , c20 , c21 , c22 , ω11 , ω12 , ω22 ) denote their unrestricted Gaussian PML estimators, which coincide with equation by equation OLS.To obtain the asymptotic distributions of these estimators, we need the first derivatives of the conditional mean vector and covariance matrix with respect to the unrestricted reduced form parameters, which are given by In this notation, the contribution to the Gaussian log-likelihood scores for c and ω corresponding to observation i will be given by and Consequently, the outer product of the scores will be and Similarly, we can easily adapt the expressions in Amengual et al. (2022) to write the contribution of observation i to the Hessian matrix h c,ωi (c,ω) Thus, we have all the ingredients to compute AVar( √ n φLS ) using the standard sandwich formula in White (1982) and Gouriéroux et al. (1984).
On this basis, we can show that the asymptotic variance of Malinvaud (1970) optimum MD estimator will be given by AVar( where .
Specifically, we obtain that which rewritten in terms of the population coefficients of determination, become

A.4 Maximum likelihood with spherical innovations
Invoking Proposition C1 in Supplementary Appendix C of Fiorentini and Sentana (2021), we can obtain the asymptotic variance of the ML estimator that imposes with , and Similarly, we can compute the asymptotic variance of the unrestricted ML estimator which also estimates σ 12 as with As a consequence, 123 SERIEs (2023) 14:253-300

Analogous calculations using Z
Once again, we can write these expressions as for the restricted estimator, and as for the unrestricted one.

A.5 Spherically symmetric semiparametric estimator (SS)
From Proposition C3 in Supplementary Appendix C of Fiorentini and Sentana (2021), the spherically symmetric SP efficiency bound is given by and Analogously, we can obtain AVar( ) by exploiting the expressions for the derivatives of the unrestricted model that we obtained when we discussed the IV estimators.

A.7 Semiparametric estimator (SP)
We can make use of Proposition D3 in Supplementary Appendix D of Fiorentini and Sentana (2021), which indicates that the SP efficiency bound for j = R, U will be given by In turn, for unconstrained estimators that also estimate σ 12 , so that ϑ † = (ϑ , ψ 12 ) , we would have

Proof of Proposition 1
Computing in Mathematica the spectral decomposition of AVar( √ n θ L S )−AVar( √ n θ M D ) using the expressions (A5) and (A6), we find that it has only one eigenvalue different from zero, namely, which is non-negative, with as associated eigenvector.Analogously, after computing the spectral decomposition of AVar( √ n θ I V ) − AVar( √ n θ M D ) using the expressions (A2) and (A6 ), we find that it has only one eigenvalue different from zero, namely, which is non-negative, with (B19) its associated eigenvector once again.Finally, doing the same for AVar( ) by combining (A2) and (A5), we find that it has only one eigenvalue different from zero, namely, which can be positive or negative depending on μ 22 , and with the same eigenvector.

C.1 Standardised variables
We start by assuming that: where the correlation matrix is positive definite.In this notation, the coefficients of the least squares projection of y 1 onto y 2 and z 1 are the corresponding projection errors and the residual variance In turn, the coefficients of the least squares projection of y 2 onto z 1 and z 2 are The coefficients μ 1 and μ 2 are sometimes called the standardised regression coefficients, as they explain the ceteris paribus change in y o 2 (measured in standard deviation units) resulting from a unit standard deviation change in z o 1 or z o 2 .Thus, once we standardise the three variables involved, the crucial ingredients of the first equation are the coefficient of determination R 2 2 , the correlation between the regressors ρ z 1 z 2 and the partial correlations between y 2 and each of the regressors, which are given by In fact, there are only three underlying parameters that determine these four quantities: ρ y 2 z 1 , ρ y 2 z 2 and ρ z 1 z 2 because Thus, we can either select ρ y 2 z 1 , ρ y 2 z 2 and ρ z 1 z 2 , or we can select R 2 2 , ρ 2 y 2 z 1 •z 2 and ρ 2 y 2 z 2 •z 1 .

D On multivariate discrete mixture of normals
Consider the following mixture of two multivariate normals u t ∼ N (ν 1 , 1 ) with probability λ, N (ν 2 , 2 ) with probability 1 − λ. (D21) Let s t denote a Bernoulli variable which takes the value 1 with probability λ and 0 with probability 1−λ.As is well known, the unconditional mean vector and covariance matrix of the observed variables are: where δ = ν 1 − ν 2 .
Therefore, this random vector, which we will denote as u * t , will be standardised if and only if For example, in the bivariate case, if we let L L denote the Cholesky decomposition of , we can write u t = π + L u * t , where π = π 1 π 2 and L = ψ 11 0 ψ 21 ψ 22 .
Additionally, let δ = δ 1 δ 2 , and ℵ L = κ 11 0 κ 21 κ 22 , so that the vector of shape parameters of u * t becomes = (δ 1 , δ 2 , κ 11 , κ 21 , κ 22 , λ) .Let us initially assume that ν 1 = ν 2 = 0, so that δ = 0. Let 1L 1L and 2L 2L denote the Cholesky decompositions of the covariance matrices of the two components.Then, we can write Thus, it is not difficult to see that choosing , we will have Similar calculations can be applied for general n, the only difference being that the number of free parameters of the standardised mixture increases with the square of the cross-sectional dimension.

Fig. 1
Fig. 1 Relative efficiency OLS/IV for α.Notes: When the R 2 of Eq. (2) coincides with ρ 2 y 2 z 2 .z 1 , the relative efficiency of the OLS/IV estimators of α is given by

1.
Student t distribution with ν = 5 or ν = 5.5 degrees of freedom corresponding to μ 22 = 3 and μ 22 = 7/3, respectively; 2. Scale mixture of two normals in which the higher variance component has probability λ = .05and the ratio of the variances is either κ = 0.094 or κ = 0.122 corresponding to μ 22 = 3 and μ 22 = 7/3, respectively; 3. Asymmetric Student t distribution with negative tail dependence b = ( − 1, −1) but degrees of freedom ν = 9.65 or ν = 10.38,respectively; 4. Location-scale mixture of two normals in which the higher variance component has probability λ = .05,μ 22 is as in 1., and the marginal skewness of u 1 and u 2 is as in 3., which is achieved with Appendix D for further details on this parametrisation).

Table 1
Asymptotic variances of alternative estimators Notes: DGP for structural innovations: bivariate Student t with 0 means, unit standard deviations, no correlation and 5 degrees of freedom.Parameter values: γ

Table 2
Asymptotic variances of alternative estimators Notes: DGP for structural innovations: bivariate asymmetric Student t with 0 means, unit standard deviations, no correlation and shape parameters ν = 9.65 and b