QML estimation with non-summable weight matrices

This paper revisits the theory of asymptotic behaviour of the well-known Gaussian Quasi-Maximum Likelihood estimator of parameters in mixed regressive, high-order autoregressive spatial models. We generalise the approach previously published in the econometric literature by weakening the assumptions imposed on the spatial weight matrix. This allows consideration of interaction patterns with a potentially larger degree of spatial dependence. Moreover, we broaden the class of admissible distributions of model residuals. As an example application of our new asymptotic analysis we also consider the large sample behaviour of a general group effects design.


Introduction
It is a broadly employed assumption in a wide range of theoretical studies on spatial econometrics that the spatial weight matrix is absolutely row and column summable. This restriction is mostly a result of the Central Limit Theorem (CLT) used in the derivation of the result on asymptotic behaviour. Historically, it can be traced to the works of Kelejian and Prucha, e.g. Kelejian and Prucha (2001), who were 1 3 first to formulate their assumptions as explicit requirements regarding the spatial weight matrix. Their CLT, which turned out to be a milestone in the development of asymptotic theories for spatial econometric models, relies on the absolute summability of the weight matrix involved. In our study we attempt to reconsider this approach, and we focus specifically on the Quasi Maximum Likelihood (QML) estimator for the spatial autoregressive model. By revisiting the classical argument of Lee (2004) and, importantly, introducing a generalised CLT for linear-quadratic forms, we are able to provide a theory for consistency and asymptotic normality of QML estimates for high-order spatial autoregressive models under relaxed conditions. In particular, our approach allows for spatial weight matrices that, even if row-standardised, may not be absolutely column summable.
The standard approach, with the absolute summability requirement on the weight matrix, undoubtedly has the appeal of a simple, self-contained theory. Although there might be a perception that the constraint is necessary for showing the desired asymptotic behaviour of various estimation schemes, it has already been recognised that this is not the case, see, e.g. Gupta and Robinson (2018). Indeed, the boundedness of row and column sums can be replaced with the less restrictive requirement of boundedness of spectral norms. 1 Unfortunately, the standard analysis excludes some of the spatial interaction patterns in which the number of spatial units influenced by any given unit grows to infinity. In particular, under infill asymptotics, if spatial units are assumed to interact with other units within a given distance, then the number of nonzero elements in a row or column of the spatial weight matrix grows with the sample size, leaving the potential for either row or column non-summability. Similarly, under increasing domain asymptotics certain spatial weight matrices based on inverted distance also lead to non-summable interaction patterns. 2 Theoretical considerations can also reveal another limitation of the standard analysis. For example, consider the cane of an initial model specification which is subjected to a transformation (e.g. linear filtering or demeaning) to obtain its final estimated form. Under the standard analysis, it is necessary to ensure that the applied transformation preserves the summability of rows and columns of the spatial weight matrix. As a result, the requirements of the standard analysis narrow the class of possible transformations of the model. A somewhat similar problem occurs if the so-called linear structure representation for model innovations is assumed. 3 Then, following the standard approach, the derivation of the asymptotic distribution requires additional restrictions on the class of possible linear relations involved.
This paper provides a positive solution to a problem left by Gupta and Robinson (2018), who have made the first attempt to replace the requirement 1 3 QML estimation with non-summable weight matrices of uniform summability of the spatial weight matrix with boundedness in the spectral norm. In that earlier work, extending the scope of spatial weight matrices beyond the standard asymptotic analysis of Lee (2004) was found to be useful, and results analogous to our Theorem 1 on consistency were independently obtained. However, their derivation of the asymptotic distribution of the estimates still relies on the assumption of absolute summability. Whether a relevant asymptotic distribution theorem under relaxed conditions is possible has been left as an open question.
Therefore, the aim of this paper is to present a refinement to the asymptotic analysis of the Gaussian Quasi-Maximum Likelihood (QML) estimator for high-order, spatial autoregressive models, considering the assumptions imposed on the spatial weight matrix. We further present an example of a possible application of our theorems. To this end, we develop a general group effects, high-order Spatial Autoregresive (SAR) model. Our approach to eliminating the effects components from the spatial process generalises that of Lee and Yu (2010a, b) and Lee et al. (2010), as well as Olejnik and Olejnik (2017), to the high-order autoregressive case. Finally, we present statements on consistency and asymptotic normality of the resulting QML estimator, which would not be possible to obtain with the standard argument.
The paper is organized as follows. Section 2 describes the motivation for addressing the topic. Section 3 presents our statements on the consistency and asymptotic normality of the Gaussian QML estimator. Finally, Sect. 4 develops an estimator for a high-order, spatial autoregressive, general group effects model, together with an analysis of its large sample behaviour. Appendices contain some details of the proofs, as well as a set of Monte Carlo simulations that empirically demonstrate the validity of the theory under the relaxed conditions.

Motivation for the refined asymptotic analysis
This section presents some basic examples of the application of our asymptotic analysis. First we present a class of spatial interaction schemes which cannot be handled by the standard asymptotic analysis of Lee (2004). Then, we describe a class of theoretical arguments for which the new analysis demonstrates a clear advantage over the standard approach. We also discuss the assumptions made in relation to the error term. We conclude the section with a brief discussion on whether our results may be considered optimal.
First, however, let us introduce the notation used in this paper when referring to norms. Unless stated otherwise, vectors, i.e. elements of ℝ m , are column vectors. For an arbitrary row or column vector x the symbol ‖x‖ denotes the usual Euclidean vector norm, which will also be called the l 2 norm. The quantity ∑ m i=1 �x i � is referred to as the l 1 norm. The same symbol, when used for matrices, denotes the induced spectral norm. That is, for a matrix A, the quantity ‖A‖ is the largest singular value of A. For square matrices A we will also use the Frobenius norm ‖A‖ F , the maximum absolute column sum norm ‖A‖ 1 , and the absolute row sum norm ‖A‖ ∞ .

Elementary examples of non-summable interaction patterns
One essential feature of an asymptotic theory for spatial econometric models is the set of assumptions imposed on the spatial weight matrix. These assumptions limit the amount of spatial interactions to a manageable degree, such that the statements on the large sample behaviour of the estimation scheme under question remain valid. It is a widely adopted standard in contemporary spatial econometrics to require that the spatial weight matrix is row and column summable. More precisely, with the conventional notation that W n is the spatial weight matrix for the sample size n ∈ ℕ , it is required that the quantities ‖W n ‖ 1 and ‖W n ‖ ∞ are both uniformly bounded in n. These conditions turn out to be unnecessarily restrictive in limiting the scope of spatial interactions that can be incorporated in an econometric model.
Let us start with a theoretical remark. Notice that the row and column summability of W n is equivalent to the rows and columns constituting a bounded set in l 1 . However, a reader familiar with the ubiquitous nature of the theory of squaresummable functions and sequences in much of the mathematical econometrics and geostatistics literature 4 might expect that the l 2 norm would play a major role in the asymptotic theory-at least for some simple cases. In fact we find that the connection between the sharp boundedness condition for an asymptotic theory and square-summability is more nuanced. Instead of examining the properties of rows and columns of the spatial weight matrices, it is necessary to consider the sequence of the respective spectral norms ‖W n ‖ . The requirement then turns out to be the boundedness of the sequence. 5 This will be discussed in subsequent sections.
The following motivating examples are connected to the class of Inverse Distance Weighting (IDW) interaction schemes, that find a wide range of uses in spatial econometrics, and other quantitative methods of geography. Let us assume that the strength of spatial interaction, represented by the spatial weights, is of the form w ij = 1 dist(i,j) , where > 0 is a parameter and dist(i, j) is a measure of the distance between units i and j. A question then arises: what are the values of for which an individual row or column of the matrix W = [w ij ] satisfies the requirement of boundedness imposed by the standard analysis? That is, for which values of is it absolutely summable?
The answer to this question will depend on the nature of the spatial domain and the type of asymptotic scheme employed. We focus our attention here on the increasing domain asymptotics, since that is the more natural asymptotic scheme in this context. Let us now consider a one-dimensional spatial domain in which the spatial units are more or less evenly spaced, or at least the distance between each pair of consecutive units does not exceed a constant distance D > 0 . Then, for an 1 3 QML estimation with non-summable weight matrices arbitrary unit j the column sums (and, by symmetry, row sums as well) of such a matrix W n = [w ij ] i,j≤n satisfy the bound The right-hand side in Eq. (1) converges if and only if > 1 . Let us note that the condition on the j-th row/column square summability would, in turn, be satisfied if the series 2 2 . Non-summable distance-related matrices were also investigated by Lee (2002) in the context of the Ordinary Least Squares (OLS) estimator. In that work, the complementary condition, that is ≤ 1 2 , was derived as necessary and sufficient for consistency and asymptotic normality (at the rate √ n ) of the OLS estimator if the matrix of inverse distances is row-normalised prior to being used in the model. With ≤ 1 2 the amount of spatial interaction (in each row) is intractable for Maximum Likelihood (ML) estimation. However, dividing elements of the weight matrix by the increasing row sums leads to max i,j≤n w ij → 0 and 1 n tr(G n ) → 0 , with G n = W n (I n − 0 W n ) −1 and 0 being the autoregressive parameter. 6 According to Lee (2002), the dominant component of the bias for the OLS estimator is proportional to 1 n tr(G n ) . That is to say, the row-normalisation reduces the spatial dependence to the extent that the OLS estimator becomes consistent. It should be noted that the results of Lee (2002) still require the spatial weight matrix (after row-normalisation) to be summable in both rows and columns. Let us make clear that in our argument we do not assume that the procedure of row-normalisation is applied to the spatial weight matrix, for the reasons presented in Sect. 2.2.
Although the condition of summability > 1 derived from Eq. (1) is not too restrictive, it is also fair to say that such a linear domain is rarely encountered in practice. If we now extend this example to higher dimensions, the restriction on also changes. To see this, let us assume that the spatial domain is, a two-dimensional plane, as it is in majority of economic applications. 7 Now, the crucial observation is the following. Let us imagine a circle of radius around a fixed spatial unit. The number of spatial units which are intersected by the circle is roughly proportional 8 to its circumference, 2 . Similarly, for any given unit j on the plane, and any radius , the number (j, ) of units i for which − 1 < dist(i, j) ≤ is roughly proportional to . Thus, we have 6 Since I n + 0 G n = (I n − 0 W n ) −1 , the value of 0 n tr(G n ) may indicate the component of the average direct effect due to the feedback loop in spatial interactions, c.f. Le Sage and Pace (2009). 7 To maintain mathematical rigour, we may also have to assume that the distribution of the spatial units is neither pathologically dense nor sparse anywhere in the spatial domain. This is actually the case in all realistic settings. In particular, we might expect that the distances between nearby units are within a fixed range D 1 ≤ dist(i, j) ≤ D 2 or, alternatively, the units are regions with areas in a given range A 1 ≤ area(i) ≤ A 2 . For example, tessellations obtained by generating a Voronoi diagram of a randomly distributed set of points constitute an adequate framework. 8 The actual constants would possibly involve D 1 , D 2 , A 1 , A 2 , see footnote 7.
Again, the series on the right hand side converges only if > 2 . Thus, in most cases for IDW spatial models, unless the exponent is well greater than two, the standard asymptotic analysis fails as the spatial weight matrix is simply not summable. This fact is apparently not widely known as it is still common to see inverse distance decay parameters, either set a priori or estimated, lie in [0, 2], see Anselin (2002). In particular, this also applies to the case of = 2 , which constitutes a popular econometric analogue of the Newtonian gravity model. 9 In this planar case, an IDW row or column would be square summable if only > 1 . Going further, it might be argued that in the less common, but still relevant 10 , three-dimensional domain the summability restriction on alpha becomes > 3 , whereas row and column squaresummability requires > 3 2 . 11 Limitations of the summability requirement of the standard asymptotic analysis are not only encountered in the case of increasing domain asymptotics. In fact, the situation is even worse, for infill asymptotics, as by definition, new neighbours are allowed to emerge within any radius about a given unit. Then, in an extreme case, even the simple common border spatial weight matrix may not be summable. Establishing asymptotics of a model based on such an interaction scheme is therefore highly problematic. Summability issues can also be found in interaction models of a non-geographical nature. As an example, consider a social networking model, where the relation of "friendship" in an online social networking service represents contiguity, and the distance is measured in terms of the contiguity path between given pair of individuals. Then, the average distance n between members an n element group is expected to grow slowly with n. This results in a behaviour similar to infill asymptotics. In other words, the quantity (j, ) is expected to grow very rapidly (cf. Eq. (2)), similarly as it would be the case in a high dimensional space.

Applicability of the new asymptotic analysis
This section explains the circumstances under which our new asymptotic analysis applies to a non-summable interactions scheme. As we noted in the previous section, a sharp condition on boundedness requires a nuanced approach. Although it can be shown that a matrix is bounded in spectral norm only if its rows and columns are square summable, those two classes of matrices are not necessarily coincident. Unfortunately, there is no straightforward test which could be applied to matrix entries to decide whether its spectral norm is bounded. Nevertheless, we 9 It is also common for statistical software to provide the functionality of generating and using such spatial weight matrices without any warning. 10 One might consider autoregressive models of natural phenomena in environmental sciences, where an additional dimension may be present, e.g. depth, altitude etc. 11 In general, in an m dimensional domain (j, ) is proportional to m−1 , thus the two condition are > m and > m 2 , respectively.

3
QML estimation with non-summable weight matrices formulate some general suggestions on the use of non-summable matrices in model specifications.
For a square-summable matrix to be bounded in spectral norm it is sufficient that one of the two additional conditions is met. The first condition is when the number of rows and columns which are not absolutely summable is finite. This case is relevant to models that distinguish a set of units, called economic "centres of gravity", for which the amount of spatial interaction is possibly non-summable. Similarly, a social networking model might distinguish a set of "leader/influencer" individuals with non-summable impact on other members of a group. In those cases the resulting matrices are bounded in spectral norm.
If the number of non-summable rows and columns is infinite or, in particular, all of them are non-summable, then the matrix still can be bounded in spectral norm. It is the case if those non-summable rows and columns are in a sense asymptotically not strongly correlated or, in other words, the corresponding weightings are not too similar. Unfortunately, this formulation is not at all precise, and thus for such interaction schemes, we suggest applying a rescaling factor as described below. Importantly, this operation preserves the structure of the spatial interdependence expressed by relative magnitudes between all weights in the matrix, e.g. Elhorst (2001) or Vega and Elhorst (2015).
Although rescaling is similar to the familiar procedure of row-normalisation, there are important differences that need to be highlighted. We note that rownormalisation is typically applied for the following three reasons. Firstly, almost by definition, it normalises the amount of spatial interaction received by each of the spatial units. This is believed to facilitate the interpretation of the autoregressive parameter. Secondly, together with non-negativity and a zero diagonal, the Greshgorin theorem 12 implies that the space for the autoregressive parameter can contain any compact 13 subset of (−1, 1) . Lastly, the procedure trivially assures summability of rows, which is a part of the boundedness assumption of the standard analysis.
Although, normalisation of rows is indeed beneficial for interpretation in a variety of contexts, e.g. common border or k-nearest neighbours schemes, it may also be harmful in certain interaction patterns. In particular, as emphasised in (Vega and Elhorst, 2015, pp. 355) after (Anselin, 1988, pp. 23-24), and Kelejian and Prucha (2010), "row-normalising a weights matrix based on inverse distance causes its economic interpretation in terms of distance decay to no longer be valid". Secondly, as the maximal modulus of an eigenvalue does not exceed spectral norm, a matrix rescaled by ‖W n ‖ −1 allows any value in (−1, 1) in its parameter space. Lastly, row-normalisation does not in general assure column summability required by the standard analysis, whereas rescaling by ‖W n ‖ −1 produces a matrix with unit spectral norm. 12 The Greshgorin circle theorem states that for any matrix A = [a ij ] i,j≤n and any its eigenvalue v there is . 13 It is prudent to recall that the parameter space for ML-type estimation should be compact.
The merits of the rescaling procedure have long been recognised and applied in practice. For example Elhorst (2001) and Vega and Elhorst (2015) rescale weight matrices by their largest characteristic roots. Notice that in the case of a symmetric weight matrix the root is equal to the spectral norm. However, it is not true in general, not even for arbitrary symmetric matrices, that such rescaling assures row and column summability. This makes our asymptotic theory necessary to justify inferences from such a model.
Although the described procedure may be used to rescale an arbitrary matrix, this does not imply that estimation is possible with any spatial weight matrix. A cautionary example has been given in Lee (2004) of a matrix W n = 1 n−1 n T n − I n , where n = (1, … , 1) T is an n × 1 vector of ones and I n is the identity matrix. In this case, the ML estimator may be inconsistent, even though W n is absolutely summable in both rows and columns. This is explained by Mynbaev and Ullah (2008), who analyse a class of weight matrices, of which W n is a member. For matrices in the class, the identification condition fails. It is not clear to us whether the OLS estimator would be consistent in this case, and nor do the results of Lee (2002) seem to provide the answer. A related matrix W n,m = I m ⊗ W n has also been considered in, e.g. Case (2015a, b), Lee (2004), and Lee (2007b). However, in the case of W n,m , if m grows at a sufficient rate, then favourable asymptotic properties can be assured. Mynbaev and Ullah (2008), and Mynbaev and Ullah (2010) study a class of weight matrices which approximate a kernel of an integral operator on the space of square-integrable functions. These may be related to the class of square-summable matrices. However, their assumptions, in particular the absolute summability of operator eigenvalues, preclude the use of ML estimation. In particular, such weight matrices contain an insufficient amount of information for identification of the autoregressive parameter. Instead, the asymptotic behaviour of the OLS estimator is investigated.

Applicability in theoretical arguments
In this section we argue that the consideration of a wider class of spatial weight matrices can also be beneficial in theoretical arguments when developing new model specifications. Let us start with a rather simplified description of one possible application. Let T n be an operator on ℝ n and let us assume that it is invertible (although in practice this is often not the case). Given a specification of a spatial model, for example the SAR specification Y n = W n Y n + X n + n , we might be interested in its transformed form T n Y n = (T n W n T −1 n )T n Y n + T n X n + T n n . Then it is natural to ask what transformations T n are known to preserve the required properties of the spatial weight matrix, so that the asymptotics of the transformed model could follow easily from the properties of the original specification. In the case of the standard asymptotic analysis of Lee (2004), this means that we want the matrix T n W n T −1 n to be row and column summable whenever W n is. Practically, this limits the class of possible matrices T n to those which are themselves summable, i.e. whose entries we can and must finely control. In that respect, the standard analysis collapses, even in the simple case of T n being an isometry, i.e. an orthogonal matrix.

3
QML estimation with non-summable weight matrices Our refined asymptotic analysis, on the other hand, calls for the operator norm of T n to be bounded. In many cases this is easily satisfied as T n is often a projection, and thus an operator of norm one, for which we have ‖T n W n T + n ‖ ≤ ‖W n ‖ . We note that, although projections are generally not invertible, in practice we may still be able to exploit the fact that the Moore-Penrose inverse T + n is an isometry on the range of T n . A more concrete example can be derived from the work of Lee and Yu (2010a, b), where the incidental parameter problem is addressed in the context of a spatial autoregressive panel model and fixed spatial and temporal effects. In this case, the natural candidate for T n is the demeaning operator, i.e. the projection on the space of zero-mean vectors within units and time periods. A similar idea is applied in Lee et al. (2010) for group effects 14 in social interaction models. With this technique, the expected multiplicative bias correction is derived for consistent estimates. Those results strongly rely on the summability of the demeaning operator T n . However, incorporation of arbitrary group effects, where groups are not necessarily disjoint and the number of groups may increase with n, leads to a demeaning operator which does not have to be summable. We show in Sect. 4 that such generalisation is possible with the refined asymptotic analysis.

On the distribution of innovations
A significant merit of the original paper of Lee (2004) is the consideration of the QML estimation scheme rather than the classical maximum likelihood estimator with the assumption of Gaussian errors. That is to say, the error term in the spatial autoregressive model is allowed to be an arbitrary random vector of independent and identically distributed components, as long as the shared distribution allows moments of order higher than four. This seemingly technical improvement has substantially changed how the validity of inference may be perceived, as it no longer has to rely on a belief in the joint normality of errors.
Similarly, with heterogeneous processes governing the underlying phenomena across spatial units, the expectation of identical distribution of disturbances does not appear to be well-founded. In our analysis, the components of the error term are, for purely technical reasons, assumed to be homogeneous in terms of variance. However, they are not required to follow the same probability law.
Lastly, we have found that the universally employed requirement that the distribution of the residuals should have finite moments higher than four is not necessary. In fact, integrability with the fourth power is sufficient to obtain the Gaussian asymptotics for QML estimates. This result opens the possibility of considering heavier tailed probability laws. For example, one might consider distribution functions with tails of order x −5 log −2 x , as x grows to infinity.

Ultimate optimality
It can be argued that our boundedness condition with the spectral norm (Assumption 4) is already optimal. That is to say, the class of spatial weight matrices considered, in general, cannot be broadened. Indeed, if we consider a matrix W n for which lim n→∞ ‖W n ‖ = ∞ , invalidating our assumptions, then we arrive at the rather uninteresting case of the parameter space of Λ not containing any positive elements. To see this, consider the example of a symmetric spatial weight matrix W n with nonnegative entries. Then, ‖W n ‖ is the maximum eigenvalue and, by a well-known argument, any interval [0, t) ⊂ Λ satisfies t ≤ 1 ‖W n ‖ → 0. Obtaining a Gaussian asymptotic distribution for the estimates requires the use of a CLT. Specifically, it is applied for a random variable Q n which is a linearquadratic form of the residual. If the fourth moments of n were not finite, then the normalising factor for Q n , namely √ Var Q n , would also be infinite. This seriously compromises any effort to derive the limiting distribution. As a result, we believe that any substantial generalisation is unlikely.

Revisiting asymptotic analysis of high-order SAR models
Let W n,1 , … , W n,d be arbitrary n × n matrices. 15 We consider the high-order SAR model described by the following equation where = ( r ) d r=1 ∈ Λ ⊂ ℝ d and ∈ ℝ k . Furthermore, Y n is a vector of n observations on the dependent variable, X n is the matrix of k non-stochastic explanatory variables and n is the error term, for which the assumptions given below hold.

Assumption 1
The matrix X T n X n is invertible, and both Let us note that Assumption 1, used for the consistency argument, does not require the sequence 1 n X T n X n to be convergent. Instead, our reasoning stipulates that this sequence is merely bounded. 16 The necessity of non-singularity of X T n X n is straightforward, as regressors should not be linearly dependent for the slope parameter to be identifiable. Note that our assumption that ‖ r W n,r Y n + X n + n , 1 3 QML estimation with non-summable weight matrices far from the well-known condition necessary for consistency of the OLS estimator for non-spatial regression, i.e. ‖(X T n X n ) −1 ‖ = o(1). 17 Assumption 2 One of the following holds 18 (a) n = ( n,i ) i≤n is a vector of independent random variables with zero mean, variance 2 and uniformly bounded fourth moments, (b) n is of the form n = F̄m where F is an n × m matrix with orthogonal rows of norm one; the underlying ̄m is a vector of independent random variables with zero mean, variance 2 and uniformly bounded fourth moments.
Assumption 3 For every in respective parameter space Λ ⊂ ℝ d the matrix Δ n ( ) = I n − ∑ r≤d r W n,r is invertible.
We investigate the asymptotic behaviour of the widely applied Gaussian QML estimator, which maximises the log-likelihood of the observed sample as if the model innovations were Gaussian, namely the function where = ( T , T , 2 ) T is the model parameter. It turns out that, under certain regularity conditions, this estimation scheme is consistent, even if the model residuals do not follow the normal distribution (c.f. Assumption 2). The estimator will be denoted (̂T n ,̂T n ,̂2 n ) T or ̂n . Our result on the asymptotic behaviour of ̂n requires the following boundedness assumption, which gives the essential condition imposed on a spatial weight matrix.

Assumption 4
The set Λ is compact in ℝ d . There exists a universal constant K Λ such that for all n ∈ ℕ , ∈ Λ , r = 1, … , d the matrix norms ‖W n,r ‖ and ‖Δ n ( ) −1 ‖ do not exceed K Λ .
Notice that any matrix with absolutely summable rows and columns is also bounded in the spectral norm. 19 That is to say, the asymptotic theory presented in this paper is indeed a generalisation of the theory initiated in Lee (2004). Moreover, 18 In particular, elements of the error term do not need to be identically distributed. Also notice that trivially (a) implies (b) for F = I n , m = n . Although condition (a) is simpler and sufficient in a standard setting, the relaxed condition (b) will be crucial in the argument of Sect. 4 when an independent vector of residuals will be transformed by such a matrix F . Note that (b) implies n = 0 and n T n = 2 I n , thus its components may be merely uncorrelated. We emphasise the distinction as the innovations are not assumed to be Gaussian. 19 This follows the fact that ‖A‖ 2 ≤ ‖A‖ 1 ‖A‖ ∞ , for an arbitrary matrix A.

3
it is also a proper generalisation, as there are non-summable interaction schemes bounded in the spectral norm. Remark 1 in the appendix describes an example of such a weight matrix, which is additionally row-standardised. This also shows that row-normalisation, in general, does not ensure absolute summability of columns.
The following identification assumption guarantees that the Gaussian QML estimator is able to asymptotically identify the true value of the spatial autoregressive parameter.
Assumption 5 For every 1 , 2 ∈ Λ , such that 1 ≠ 2 , at least one of the statements (a) or (b) is satisfied: , M X n = I n − X n X T n X n −1 X T n . Assumption 5 is typically called the identification condition. It ensures that there is enough information in the observed process to decrease uncertainty of ̂n , with increasing n. The distinction between the statements (a) and (b) reflects the fact that this information can come from either the spatial autocorrelation of Y n or via the accumulated spatial lag of regressors.

Theorem 1 Under Assumptions 1-5 the Gaussian QML estimator ̂n is consistent.
In order to establish the asymptotic distribution of √ n(̂n − 0 ) we need to adopt a number of additional assumptions. Let Ξ ⊂ ∏ ∞ n=1 ℝ n be the linear space 20 of all sequences (x n ) n∈ℕ , with x n = (x n,i ) i≤n ∈ ℝ n , n ∈ ℕ , for which max i≤n x 2 n,i = o(n) . Additionally, let us set G n,r = W n,r Δ n ( 0 ) −1 for r ≤ d.

Assumption 1'
The earlier Assumption 1 on X n is satisfied. Moreover, each column of the matrices X n and G n,r X n 0 , r ≤ d , is a member of the linear space Ξ.
The above is a technical assumption necessary for obtaining asymptotic normality of the deviation √ n �̂n − 0 � . Intuitively, the limiting distribution can be normal regardless of the original distribution of n only when none of the observations within the regressor matrices makes an overwhelming contribution to the estimate of the corresponding slope coefficient. Let us emphasise that this assumption is also necessary in the simple case of non-spatial least squares regression. Although implicitly, it is also present in the standard analysis as a consequence of other assumptions adopted therein, in particular, the boundedness of elements of X n . 21 20 Naturally, the set ∏ ∞ n=1 ℝ n = ℝ × ℝ 2 × … is a vector space when endowed with element-wise addition. Then, Ξ is its linear subspace. 21 Under the assumption of boundedness of elements of X n , as in Lee (2004) and Lee and Yu (2010a), Assumption 1' is implied by the relation ‖G n,r ‖ 1 = o(n) or, if W n,r is row-normalized, ‖Δ n ( 0 ) −1 ‖ 1 = o(n).

3
QML estimation with non-summable weight matrices Assumption 2' The error term satisfies (a) in Assumption 2. Moreover, the family of random variables 4 n,i , n ∈ ℕ , i ≤ n , is uniformly integrable.
Derivation of the asymptotic distribution requires strengthening of the Assumption 2. However, in Assumption 2' the elements of the error term still do not need to follow the same distribution. Instead, we impose the requirement that those distributions have uniformly integrable tails in terms of the fourth moment.

Assumption 6
The true value of parameter lies in the interior of the space Λ , that is 0 ∈ Int Λ.
Assumption 7 spells out the necessary conditions for the existence of the limiting distribution variance. Note that for consistency of ̂n the sequences (ℑ n ) n∈ℕ , (Σ S,n ) n∈ℕ do not need to converge. A limiting distribution theorem could also be easily obtained under the mere assumption that the norms ‖ℑ n ‖ , ‖ℑ −1 n ‖ , ‖Σ S,n ‖ exist and are uniformly bounded for sufficiently large n. However, its statement would be expressed in terms of a transformation of √ n(̂n − 0 ) , rather than the deviation itself, as is the case in, e.g. Gupta and Robinson (2018). The requirement of invertibility of the matrix ℑ could also be relaxed. However, using the present argument, it would only be possible to obtain partial results. An approach to the problem of the singularity of ℑ which considers various convergence rates has been described in Lee (2004). Finally, we obtain the following generalisation of Theorem 3.2 in Lee (2004).

Application to a higher-order general group effects model
This section provides an illustration of an application of our refined asymptotic analysis to a theoretical argument. We describe a new group effect elimination scheme for the high-order SAR model, and using arguments of Sect. 3, we derive the asymptotics of the corresponding QML estimator. In the simple case of a constant number of group-specific effect dummy variables a consistent, asymptotically normal QML estimator can be obtained quite straightforwardly. In general, however, a more careful approach is necessary. Firstly, the incidental parameter problem must be accounted for to assure consistency of estimates. Secondly, a certain degree of compatibility with the spatial weight matrix has to be assured.
Let us consider a modified version of the SAR model specification (c.f. Eq. (3)) in which an additional term associated with group-specific effects is introduced. The model specification then becomes where ∈ ℝ is the vector of group-specific effects, with = (n) possibly increasing with n, and the columns of the corresponding matrix Φ n are typically dummy variables distinguishing non-overlapping groups of observations. 23 Importantly, we note that, as the number of columns in Φ n may change with sample size, the theorems of Sect. 3 cannot be applied directly.
In applied spatial econometrics it is common to eliminate fixed effects by means of the demeaning procedure, see e.g. Elhorst (2014), which can be understood as a simple projection on the space orthogonal to the columns of Φ n . This is therefore closely related to the well-known Frisch-Waugh theorem, see Baltagi (2005). However, with an increasing number of groups, we must be concerned about singularity of the resulting variance, as expressed in e.g. Anselin et al. (2008) and the incidental parameter problem that can arise in such a setting. An effective method of dealing with those issues is developed in Lee and Yu (2010a), where a projection onto a lower dimensional space is applied to properly derive the asymptotic distribution of the resulting QML estimator. The technique presented in this paper extends this idea. Our approach consists in handling the group effect term together with its spatial lags, that is W n,r Φ n , W 2 n,r Φ n , … . At the same time, the transformed model is projected onto a lower dimensional space, following the idea of Lee and Yu (2010a).
Let K n ⊂ ℝ n be the linear subspace generated by iterating W n,r on the columns of the matrix Φ n . In other words, K n is the smallest W n,r -invariant subspace containing the columns of Φ n . Indeed, any spatial lag of Φ n (member of K n ), when multiplied by W n , is yet another spatial lag, thus it is a member of K n . The same is true for all linear combinations of such spatial lags. Our idea is to filter out those vector components of both Y n and X n which lie in K n . Under the assumption that the orthogonal complement K ⟂ n is sufficiently rich, we can obtain a consistent QML estimator of = ( T , T , 2 ) T . Let n * = n − dim K n and fix an n * × n matrix F whose rows form an orthonormal basis of K ⟂ n . It is easy to observe that M K = F T F is the projection onto K ⟂ n and FF T = I n * .
(5) Y n = d ∑ r=1 r W n,r Y n + X n + Φ n + n , 23 Groups are understood as subsets of observations within the sample to which the specific effect is attributed. As an example, this contains the individual fixed effects model as a special case. That is, for balanced panel data when a longitudinal sample of size n is indexed in a way that distinguishes N spatial units and T time periods, n = NT , groups contain observations relevant to respective spatial units. Similarly, when each group contains observations ascribed to the respective time periods, we arrive at the time-specific fixed effects specification. Let us also note that, formally, columns of Φ n are allowed to be arbitrary vectors, as long as the relevant assumptions of this section hold.

3
QML estimation with non-summable weight matrices Denote Y * n = FY n , X * n = FX n , * n = F n and W * n,r = FW n,r F T . As I n − F T F projects onto K n we have FW n,r I n − F T F = 0 . Thus, transforming the specification of Eq. (5) with F , we obtain since FW n,r = FW n,r F T F and, by definition, FΦ n = 0.
Let us observe that * n satisfies Assumption 2 (b) if the original n satisfies Assumption 2 (a) or (b). The crucial observation, however, is that Assumptions 3 and 4 are satisfied when W * n,r is substituted for W n,r . Indeed, with Δ * n ( ) = I n * − ∑ r≤d r W * n,r = F(I n − ∑ r≤d r W n,r )F T and Δ * n ( ) −1 = F(I n − ∑ r≤d r W n,r ) −1 F T it is sufficient to note that ‖W * n,r ‖ ≤ ‖W n,r ‖ and ‖Δ * n ( ) −1 ‖ ≤ ‖Δ n ( ) −1 ‖ , as ‖F‖ = 1 , whenever n * > 0.
Assumption A The earlier Assumptions 1 and 5 hold for the transformed specification of Eq. (6), that is with X * n , W * n , n * substituted in place of X n , W n , n, respectively.
The following result is an immediate consequence of Theorem 1.
Let us note that the mere application of Theorem 2 is not sufficient to establish asymptotic normality of √ n * �̂ * n − 0 � . The main difficulty is that the components of F n do not have to be independent, even if the original n is. 24 However, with our asymptotic analysis, a valid argument is still possible.

Assumption A'
The earlier Assumption A holds and Assumptions 1' is satisfied for F T X * n , F T G * n,r X * n 0 and n * . Moreover, Assumption 7 holds with Y * n , X * n , n * , ℑ * n etc. in the capacity of Y n , X n , n , ℑ n ....
To capture the relationship between our group effect elimination technique and the classical demeaning operator, let us observe that, if specification of Eq. (6) is further transformed with F T , then we arrive at the proper Gaussian log-likelihood of , given Y † n = F T FY n = M K Y n and X † n = M K X n . Indeed, we obtain where Δ † n ( ) = M K Δ n ( )M K and pdet denotes the pseudo-determinant, i.e. the product of all non-zero singular values.
One advantage of the log-likelihood in Eq. (7) is that it does not depend on a particular choice of matrix F . Moreover, given det Δ n ( ) the pseudo-determinant might be computed using the determinant formula for block matrices. If E is a matrix such that F T , E T is an orthogonal matrix, then we have the relation For some specifications of the group or fixed effects, the determinant of EΔ n ( )E T can be found analytically. For example, in a panel setting, with n = mt , time-invariant matrices W n,r = I t ⊗W m,r and a usual matrix Φ n of spatial unit dummy variables, it can be shown that det EΔ n ( )E T equals to det � I m − ∑ r≤d rWm,r � . If the matrices in W m,r are additionally rownormalised and the matrix Φ n incorporates both temporal and spatial fixed effects, we have det Lastly, in the case of temporal fixed effects specification, it can be shown that

Closing remarks
In this paper we have revisited the analysis of the asymptotic behaviour of the wellknown Gaussian QML estimator for higher-order SAR models. Our findings indicate that the standard assumptions on row and column summability of the spatial weight matrix can be weakened to cover econometric models with a greater degree of spatial dependence. Additionally, it is possible to apply a broader class of model transformations in theoretical arguments, without violating the essential boundedness requirement. Secondly, weaker conditions on the existence of moments of the error term can be imposed and its elements do not need to be identically distributed as long as their kurtosis is uniformly bounded. We expect that our results can be used to reconsider the asymptotic behaviour of QML estimation in more general specifications. Moreover, large sample theories for other estimators, in particular, General Method of Moments or Two-Stage Least Squares, can benefit from reapplication of our arguments, especially our Theorem 5 in "Appendix C". We should also mention that we have made the effort to avoid certain mathematical imprecision that can be found in the arguments of the standard analysis. For example, we properly derive the asymptotic distribution based on the Cramér-Wald theorem. Moreover, 1 3 QML estimation with non-summable weight matrices our proofs rely neither on the existence of the Lagrange remainder in the Taylor expansion nor on the mean value theorem. 25 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.

Appendix A: Monte Carlo experiments
We have conducted computer simulations to show that, under the relaxed boundedness condition, the asymptotic theory is valid. We considered four different spatial interaction schemes in a linear setting. The first matrix considered, W 1 n , is a common nearest neighbour matrix, with one distinguished central unit, whose interaction with other units is defined by the IDW scheme with the power parameter = 2 . This is a summable matrix and the results of Monte Carlo experiments may serve as a point of reference for non-summable settings.
The matrix W 2 n is analogous to the matrix W 1 n , with the crucial difference that the power parameter = 1 . This leads to an interaction scheme which is no longer summable. The third matrix, W 3 n , is yet another variation on the same idea. However, instead of using the IDW scheme, the non-summable interaction pattern is now uniform, with all its weights equal to 1∕ √ n . It is the largest possible square-summable, uniform pattern, with respect to size of the weights. Lastly, the matrix W 4 n is obtained from the symmetric non-summble IDW matrix, all elements of which are proportional to 1 |i−j| , with i, j being its indices (here we have = 1 ). This matrix has been rescaled by its norm.
The Monte Carlo simulations for all of the matrices W i n , i = 1, 2, 3, 4 , were conducted with a mixed regressive autoregressive specification, see Eq. (3), with a single autoregressive parameter 0 = 0.3 . The regressor matrix X n contained a constant term c 0 = 2 and one regressor, uniformly distributed in an interval symmetric around zero, with the corresponding slope 0 = 3 . In all simulated models the innovations were Gaussian with variance 2 0 = 1. A constant number of Monte Carlo samples, m = 5000 , was used for all trials. Tables 1, 2, 3, 4 show the expectation estimate, the standard error for the estimator, and the Kolmogorov-Smirnov assessment of normality of the individual components of the scaled difference √ n �̂n − 0 � . For all matrices a clear tendency can be seen for the values of the bias and its standard deviation to diminish with increasing n. Moreover, in most cases the quotient of the absolute bias divided by the standard deviation does not Sign. level - * * * * * * * * * * * Sign. level - * * * * * * * * * * * * *  Sign. level -- * * * * * * * * * exceed 0.1256, which implies that true values of parameters lie within 0.05 one-sided confidence intervals for the centre of the distribution of estimates. We have also examined the differences between the theoretical variance of the estimator (implied by the matrix −1 ) and the values derived from the samples. The relative differences ranged from zero to four percent, with an average of roughly two percent, which is consistent with the relative standard deviation √ 2∕m of the 2 (m) distribution.

Appendix B: Additional facts
Remark 1 There is a row-normalised matrix W n which is bounded in spectral norm, yet ‖W n ‖ ∞ is unbounded.
Proof Set D n = [d ij ] i,j≤n with non-zero entries d 1,j = 1∕j , d i,1 = 1∕i and d i,i+1 = d j+1,j = 1 if i, j > 1 . As an illustration, we have Proof of this non-trivial fact can be found in the supplementary material.

Lemma 1 Let U ⊂ ℝ d be an open set and
Lemma 2 Let n = ( n,i ) i≤n be an n × 1 random vector satisfying Assumption 2, and let (A n ) n∈ℕ and (P n ) n∈ℕ be sequences of n × n matrices satisfying If x n ∈ ℝ n , for n ∈ ℕ , is a non-random vector satisfying ‖x n ‖ 2 = O(n) , then (a) for Z a n = 1 n x T n A n n we have Var Z a n ≤ Proof of Lemma 2 is given in the supplementary material.

Appendix C: Theorems
Proof of Theorem 1 Let 0 = T 0 , T 0 , 2 0 T be the true value of parameter . Let S n ( ) = Δ n ( )Δ n ( 0 ) −1 , ∈ Λ . It is a standard approach to use the first-order optimality conditions for and 2 , that is sup n∈ℕ ‖A n ‖ < ∞, sup n∈ℕ ‖P n ‖ ≤ 1 and P n = P T n P n .
to obtain the concentrated log-likelihood ln L c n ( ) = − n 2 ln 2̂2 n ( ) − n 2 + ln | | det Δ n ( ) | | , which is maximised by ̂n . Let us set R n ( ) = 1 n ln L c n ( ) and note that each R n is a random function.
The proof proceeds as follows. First, we obtain consistency of ̂n by the generic argument presented in Lemma 3.1 of Pötscher and Prucha (1997). Thus, we introduce a new, non-random function which uniformly approximates R n for large n. 27 Then we show that 0 is an identifiably unique maximiser of each R n , as in Definition 3.1 of Pötscher and Prucha (1997) and, as a result, we obtain consistency of ̂n . Finally, we deduce consistency of ̂n and ̂2 n .
27 Note that the value of R n represents a log-root-mean-square of n √ L c n ( ) with L c n ( ) being the concentrated likelihood of . Although simply choosing R � n = R n instead of current R n = 1 2 ln e 2R n might seem more natural, the use of R n results in simpler computations and the difference between R ′ n and R n diminishes as the randomness of R n decreases with n → ∞.

Similar arguments imply that
. Finally, as 2 0 = plim n→∞ 1 n ‖ n ‖ 2 , by statement (b) of Lemma 2, we also have plim n→∞̂2 n = 2 0 . ◻ Theorem 5 Let ( n ) n∈ℕ satisfy Assumption 2' and let x n = (x n,i ) i≤n , n ∈ ℕ , be column vectors. Denote Q n = T n x n + A n n and assume that Var Q n > 0 for sufficiently large n ∈ ℕ . If ‖x n ‖ 2 + ‖A n ‖ 2 F = O(Var Q n ) , ‖A n ‖ 2 = o(Var Q n ) and max i≤n x 2 n,i = o(Var Q n ) , then Q n − Q n √ Var Q n converges in distribution to a standardised normal variable N(0, 1).
Proof of Theorem 5 is given in the supplementary material. The argument is based on bounds originally developed in Bhansali et al. (2007), where a CLT for quadratic forms of i.i.d. vectors is shown.
Corollary 1 Let ( n ) n∈ℕ satisfy Assumption 2' and let x n = (x n,i ) i≤n , n ∈ ℕ , be column vectors. Denote Q n = T n x n + A n n . If lim n→∞ Var Q n exists and is positive, ‖x n ‖ 2 + ‖A n ‖ 2 F = O(1) and ‖A n ‖ 2 + max i≤n x 2 n,i = o(1) , then Q n − Q n √ Var Q n converges in distribution to a standard normal variable N(0, 1).

Proof of Theorem 2
With S n defined in Assumption 7 and G n,r = W n,r Δ n ( 0 ) −1 , a straightforward calculation 29 shows that the consecutive entries of 1 √ n S n are We will show that 1 √ n S T n converges in distribution to N 0, Σ S . 30 2 n (̂n) = 1 n 30 Naturally, it is not sufficient to establish asymptotic normality of the above formulae, c.f. Lee (2004). Our argument follows by considering two cases and makes use of the standard Cramér-Wald theorem (see e.g. Billingsley (1995)).

Proof of Theorem 4
The proof relies on the same argument as the proof of Theorem 2, up to the point where our CLT is used to deduce asymptotic normality of the linear-quadratic form 1 √ n * S * , with as previously. Then it can be seen that, for arbitrary x n and A n we have x T n * n + ( * n ) T A n * n = x T n F n + ( n ) T F T A n F n , hence a linear quadratic form of * n is, in fact, a linear-quadratic form of n . Finally, it is sufficient to note that, in the case of 1 √ n * S * , by Assumptions A' and 4 we have (1) and lim n→∞ Var 1 √ n * S * = T Σ S * . Thus, Corollary 1 can be used to deduce that 1 √ n * S * converges in distribution to N 0, T Σ S * . Again, the remainder of the proof proceeds accordingly to the proof of Theorem 2. ◻ (C.6) ‖R n (̂n)‖ ≤ M n ‖Ĩ −1 n ‖