An alternative look at the linear regression model

An alternative look at the linear regression model is taken by proposing an original treatment of a full column rank model (design) matrix. In such a situation, the Moore–Penrose inverse of the matrix can be obtained by utilizing a particular formula which is applicable solely when a matrix to be inverted can be columnwise partitioned into two matrices of disjoint ranges. It turns out that this approach, besides simplifying derivations, provides a novel insight into some of the notions involved in the model and reduces computational costs needed to obtain sought estimators. The paper contains also a numerical example based on astronomical observations of the localization of Polaris, demonstrating usefulness of the proposed approach.


Introduction
The problem of curve fitting on the basis of a finite number of observations arises in almost all areas where mathematics is applied and one of the most powerful tools used for this purpose is based on the least squares method. Over the years a rich sample of results occurred in the literature providing an indisputable evidence that the matrix analysis concepts and techniques offer handy means to apply the method. The present paper constitutes a further contribution to this stream of considerations by demonstrating how an expression for the Moore-Penrose inverse of a columnwise partitioned matrix derived in Baksalary and Baksalary (2007, Theorem 1) may be advantageously utilized to deal with the problems originating from linear regression.

B Oskar Maria Baksalary
OBaksalary@gmail.com 1 Among the benefits resulting from the proposed approach one may mention: a simplification of derivations, a novel insight into the notions involved in the regression model, and reduction of computational costs necessary to obtain sought estimators. Furthermore, by simplifying inevitable mathematical operations, the proposed approach offers an attractive alternative to the researchers, who are not keen on exploiting more advanced than necessary matrix methods or utilizing software packages which do not provide a comprehensive control over the processed data, as the approach enables to perform calculations almost "by hand" preserving an insight into every step of linear regression.
The aforementioned representation of the Moore-Penrose inverse established in Baksalary and Baksalary (2007, Theorem 1) is recalled in the following lemma. Lemma 1.1 Let A be an n × m, m ≥ 2, real matrix columnwise partitioned as A = (A 1 : A 2 ), with A i denoting n × m i , i = 1, 2, matrices such that m 1 + m 2 = m. Furthermore, let the ranges of A 1 and A 2 be disjoint. Then the Moore-Penrose inverse of A is of the form where Q i , i = 1, 2, is the orthogonal projector onto the null space of the transpose of In the next section we briefy discuss particular linear regression models, shading a spotlight on issues in which the Moore-Penrose inverse of a columnwise partitioned matrix naturally emerges. These considerations are followed by Sect. 3, which contains an example demonstrating applicability of the present approach. The data used in the example originated from the observations of the position of Polaris made on December 12, 1983, by S.G. Brewer, which were afterwards used by Pedler (1993) to develop a solution to an (as the author claims) "astronomical problem" aimed at fitting a circle to a set of points. Section 4 provides a number of remarks concerned with the proposed approach.

Particular linear regression models
All matrices occurring in what follows are of real entries and the superscript stands for a matrix transpose. Let us consider the linear regression model where y is an n ×1 random vector of observations, X is an n × p known model (design) matrix of constants, β is a p × 1 vector of unknown parameters, and u is an n × 1 vector of unknown errors. The entries of the vector u = (u 1 , u 2 , . . . , u n ) are assumed to have a mean of zero and (unknown) variance σ 2 , and each pair u i , u j , i = j, is assumed to be uncorrelated, i.e., the expectation vector and the covariance matrix of u are E(u) = 0 and Cov(u) = σ 2 I n , respectively. Customarily, the symbol I n stands for the identity matrix of order n. We also assume that the matrix X is of full column rank. Then, the least squares estimator (LSE) of β is given bŷ It is worth emphasizing that the assumption that X is of full column rank plays a crucial role, and is most often made to assure uniqueness of the estimator of β; see Puntanen et al. (2011, p. 34). It turns out that (X X) −1 X = X † , i.e., the Moore-Penrose inverse of X; see Appendix A.
To calculate X † , instead of bothering with the inverse of X X, we may write the regressor matrix X in the columnwise partitioned form where X i , i = 1, 2, denote n × p i matrices such that p 1 + p 2 = p. Since X is of full column rank, it follows that where R(.) stands for the column space (range) of a matrix argument. According to Lemma 1.1, the Moore-Penrose inverse of a matrix of the form (3), such that (4) is satisfied, can be expressed as where Q i = I n − X i X † i is the orthogonal projector onto N (X i ), the null space of X i , i = 1, 2; see Appendix A. Note that the condition (4), under which the representation of the Moore-Penrose inverse (5) is valid, is weaker than the requirement that X specified in (3) is of full column rank (the assumption which will be extensively exploited in what follows).
Let us consider a simple linear regression model is the vector of observations on one regressor variable. The vector u = (u 1 , u 2 , . . . , u n ) consists of the unknown errors. To obtain the LSE of the parameter vector β = (β 0 , β 1 ) we may partition n × 2 matrix X as where R(1) ∩ R(x) = {0} since we assume that X is of full column rank. By (5) it follows that where Q x 1 = (I n − xx † )1 and Q 1 x = (I n − 11 † )x are both column vectors. Thus, by the identity (A1) given in Appendix A, Consequently, the LSE of β = (β 0 , β 1 ) iŝ From (8) we obtain Let us use the symbol H to denote the so-called hat-matrix, which represents the orthogonal projector onto R(X), i.e., H = XX † . Then, the identities (6)-(8) entail which means that the hat-matrix is a sum of two matrices of rank one. Furthermore, each of the summands involved in (10) is idempotent. This observation leads to the conclusion (see e.g., Rao and Mitra (1971, Theorem 5.1.2)) that the matrices necessarily commute and their product is equal to the zero matrix, i.e., Since 1 ∈ R(X), it follows that H1 = 1. Hence, by denotingŷ = Hy, we see that 1 ŷ = 1 Hy = 1 y, which gives n i=1ŷ i = n i=1 y i . Furthermore, by puttinĝ u = (I n − H)y, we obtain 1 û = 1 (I n − H)y = 0, so n i=1û i = 0. Another consequence of 1 ∈ R(X) is the identity with J = 11 † ; see Puntanen et al. (2011, Proposition 8.5). Alternatively, the equality (11) can be expressed as SST = SS R + SS E, where SST stands for the total sum of squares, SS R for the regression sum of squares, and SS E for the residual sum of squares. The coefficient of determination defined as Clearly, R 2 ≥ 0. Another observation is that where the symbol L ≤ denotes the Löwner partial ordering, from where we conclude that 0 ≤ R 2 ≤ 1. The fact that values of R 2 are restricted to the interval [0, 1] is known in the literature (see e.g., Davidson and MacKinnon (1993, p. 14)), but usually it is demonstrated in rather more involved way than in the present paper.
Consider now the general linear model with intercept where X and β are of dimensions n ×( p −1) and ( p −1)×1, respectively, and 1 : X is assumed to be of full column rank. Then, by analogy to (9), the LSE of (β 0 , β) is On account of (A1), we obtain (Q X 1) † = (1 Q X 1) −1 1 Q X . Hence, (Q X 1) † = (Q X 1) † Q X . Similarly, we arrive at (Q 1 X) † = (Q 1 X) † Q 1 . In consequence, since E(y) = β 0 1 + Xβ, and since both, Q 1 X and Q X 1, are of full column ranks, we have Note that the hat-matrix turns out to be a sum of two matrices of which the former is of rank one and the latter is of the same rank as matrix X, i.e., p − 1. Similarly as above, both summands which determine the hat-matrix specified in (14) are commuting idempotents, whose product equals the zero matrix. The formula (14) (as well as its particular case (10)) can be viewed as an alternative to the representation of H as a sum of two orthogonal projectors, which reads H = J + P CX , where P CX = CX(CX) † , with C denoting the so-called centering matrix defined as C = I n − J; see Puntanen et al. (2011, formula (8.108)). It is clear that JP CX = 0.

Applications
As in Pedler (1993, Sect. 6), we consider now the linear regression model with 4 regressors ⎛ Then (15) can be written as As the matrix X is of full column rank, we can determine the LSE of β by applying the representation derived in Baksalary and Trenkler (2021, Example 1) as a consequence of Baksalary and Baksalary (2007, Theorem 1). In order to take advantage of this result, let where a = 1 n n i=1 a i and b = 1 n n i=1 b i . Then, on account of Baksalary and Trenkler (2021, formula (8)), we obtain a very handy representation of the Moore-Penrose inverse of the model matrix (16), namely Hence, analogously to (13), the LSE of the parameter vector β is given bŷ Let us now demonstrate the usefulness of the expressions (20) and (21) by applying them to a set of real data. Introduction, Pedler (1993) considered "astronomical problem" of fitting a circle to a set of points and solved it from first principles. The considerations in Pedler (1993) contain also an example which exploits the data collected by S.G. Brewer from the observations of the position of Polaris made on December 12, 1983. The data are given in Table 1.

Example 3.1 As mentioned in
The data provided in Table 1 enable to calculate the scalars A, B, M, N as well as the vectors e, f, g, h defined in (17)-(19). Hence, we obtain the vector inner products of the four vectors and the vector z. It should be emphasized that these straightforward calculations involve only scalars and vectors and require relatively low computational cost; for details on advantages of the algorithm to calculate the Moore-Penrose inverse which takes into account columnwise partitioning into range disjoint matrices see Baksalary and Trenkler (2021). The outcomes of these computations are provided in Table 2.
In the light of (21), we arrive at the components of the estimator ofβ, which are given in Table 3. As expected, the values coincide with the ones given in Pedler (1993, Table 2).  The values of the total sum of squares SST , regression sum of squares SS R, residual sum of squares SS E, and the coefficient of determination R 2 are provided in Table 4.

Supplementary remarks
In the linear regression models considered in Sect. 2 it was assumed that the model matrices are of full column ranks and that the vector 1 is one of the columns. Such assumptions are well justified as they correspond to several common situations. However, the present approach enables to generalize the considerations by weakening the assumption that the model matrix is of full column rank to the requirement that it can be columnwise partitioned into two range disjoint matrices and by relaxing the assumption that one of the columns is the vector 1. To demonstrate this fact, let us assume that the model matrix X in (2) is partitioned in accordance with (3), i.e., y = X 1 β 1 + X 2 β 2 + u, where vectors β i are of orders p i × 1, i = 1, 2. Provided that the condition (4) holds, by (5) we conclude that the LSE of (β 1 , β 2 ) is given by Visibly, the expression (13) is obtained from (23) by taking X 1 = 1 and X 2 = X. Furthermore, from (3) and (5) we obtain which under X 1 = 1 and X 2 = X leads to the formula (14). Note that the expression (24) is given in Puntanen et al. (2011, Proposition 16.1) along with its equivalent counterparts, one of which is (4). Another evidence of the applicability of Lemma 1.1 in statistical estimation theory was provided in Baksalary and Trenkler (2021), by deriving an original representation for the best linear unbiased estimator (BLUE) of Xβ under the generalized version of the (consistent) linear model (2) with Cov(u) = σ 2 V, where V denotes a known n × n positive semidefinite matrix. It was shown in Baksalary and Trenkler (2021, Example 4) that Gy with is BLUE of Xβ. Analogously, we can derive representations for BLUE of both, X 1 β 1 and X 2 β 2 under the model (22) when Cov(u) = σ 2 V. From Puntanen et al. (2011, formula (10.5)) it follows that G 1 y is BLUE of an (estimable) parametric function X 1 β 1 if G 1 satisfies the equation G 1 (X 1 : X 2 : VQ X ) = (X 1 : 0 : 0).
On account of (4), which is a necessary and sufficient condition for X 1 β 1 to be estimable, we arrive at R(X 1 ) ∩ R(X 2 : VQ X ) = {0}. In consequence, we can utilize the representation of the Moore-Penrose inverse provided in Lemma 1.1, which leads to the conclusion that one of the solutions of (26) is of the form Hence, Similarly, by interchanging the subscripts "1" and "2" in (26), we obtain BLUE(X 2 β 2 ) = X 2 X 2 Q (X 1 :VQ X ) X 2 † X 2 Q (X 1 :VQ X ) y.
The paper is concluded with some remarks concerned with advantages of utilizing the representation of the Moore-Penrose inverse provided in Lemma 1.1 from the computational point of view. In comparison to the methods of determining the inverse based on the singular value decomposition (SVD), which are exploited in several popular software packages (e.g., Matlab, Mathematica or R), an algorithm based on the representation (1) seems to have three main advantages, each leading to a dropping of computational costs. The first one is that it reduces sizes of matrices to be Moore-Penrose inverted-instead of the inverse of an n × m matrix A, we need to compute two inverses of matrices Q 2 A 1 and Q 1 A 2 of orders n × m 1 and n × m 2 , respectively, where m 1 + m 2 = m; as several software tools impose limits on sizes of matrices which can be stored, one can encounter a situation in which the inverse of A may exceed the limit, while the inverses of Q 2 A 1 and Q 1 A 2 are still manageable. The second benefit is that the algorithm allows computing both block entries occurring in the inverse (almost) simultaneously-in the light of Baksalary and Trenkler (2021, formula (6)), the two entries involved in the representation (1) are linked by the identity which means that one of the Moore-Penrose inverses involved in the representation can be derived from the knowledge of the other. The third advantage of the algorithm is that it can be executed iteratively in each subsequent step to matrices of smaller orderfrom Baksalary and Trenkler (2021, Lemma 1) it follows that when A is of full column rank, then Q 2 A 1 and Q 1 A 2 are of full column ranks as well, which means that the inverses (Q 2 A 1 ) † and (Q 1 A 2 ) † can be computed by applying the same algorithm; the procedure might be carried out iteratively till the matrices to be inverted are all reduced to (row) vectors.
The matrix S † is called the Moore-Penrose inverse of S. It can be verified that which takes the form S † = (S S) −1 S , when S is of full column rank. Another relevant property of the Moore-Penrose inverse is that it offers a handy way to represent orthogonal projectors in R n (symmetric idempotent matrices of order n). To be precise, an n × n matrix P is an orthogonal projector if and only if it is expressible as SS † for some n × m matrix S. Then, SS † is the orthogonal projector onto R(S) and, consequently, I n − SS † is the orthogonal projector onto the orthogonal complement of R(S), which coincides with N (S ). Similarly, S † S and I m − S † S are the orthogonal projectors onto R(S ) and N (S), respectively, where R(S ) ⊥ ⊕ N (S) = R m . An important feature is that there is a one-to-one correspondence between the orthogonal projector and the subspace onto which it projects.