Adding a constant to the variables in the regression through the origin: the effect on the uncentered R 2

It is well known the effect on uncentered R 2 stemming from adding a constant to dependent variable in a linear regression model with intercept. In this paper, we investigate the effect of adding a constant to variables on the uncentered R 2 when a linear regression through the origin is used. In particular, we consider two cases. First, a constant c ∈ ℝ is added to all observations of the dependent variable. Second, a constant c ∈ ℝ is added to all the observations of both the dependent variable and at least an independent variable. We show that in both cases there is an artificial variation of the uncentered R 2 . This quantity is not invariant under location change.


Introduction
Linear regression is one of the most familiar tools used to model the linear relationship between a dependent variable and one or more independent variables. In this paper, we will consider a linear no-intercept model. A very important question is: how to measure the fitting of the model to a set of observations? It is well known that the centered coefficient of determination, R 2 , can assume negative values when the model does not contain an intercept. This happens because the sum of squared residuals may be larger than the explained sum of squares. Thus, the centered coefficient of determination does not make sense for regressions without a constant term. Several authors have suggested that, in these cases, a more appropriate measure of goodness of fit is the so-called uncentered coefficient of determination (see, for example, Hahn (1977) and Montgomery et al. (2012), p.48). A description of the situations in which the regression through the origin is appropriate is provided by Eisenhauer (2003).

3
In this paper, we investigate the effect of adding a constant to the variables on the uncentered coefficient of determination, when a linear no-intercept model is used. In particular, we consider two cases. First, a constant c ∈ ℝ is added to all observations of the dependent variable. Second, a constant c ∈ ℝ is added to all observations of the dependent variable and to all observations of at least an independent variable. We show that in both cases there is an artificial variation of the uncentered coefficient of determination.
We think that this result provides a quite new contribution in the field. In fact, it is well known that, when a model containing an intercept is considered, the uncentered coefficient of determination is not invariant to changes of measuring units whereby a constant is added to all observations of the dependent variable (see, for example, Davidson and MacKinnon (1999)). However, we have not been able to give a reference where it is shown that the same happens when a regression through the origin is considered.
We also show that, when regressors do not include a constant term, to add a large constant to all observations of the dependent variable makes the uncentered coefficient of determination very close to a limit value less than 1. This is another relevant difference with respect to the model with an intercept. In fact, when we consider a model with an intercept, this limit value is 1.
The rest of the paper is organized as follows. Section 2 introduces the definition of uncentered coefficient of determination. Section 3 presents the main results. Section 4 considers an illustrative example. Section 5 summarizes and offers concluding remarks.

The uncentered R 2
Let = [y 1 , y 2 , … , y n ] T ∈ ℝ n be a vector of "observations" and = [ 1 , … , k ] a n × k "data matrix" each column i of which is a vector in ℝ n , with n > k . The linear regression model assumes that where ∈ ℝ k is a vector of unknown coefficients, and u is an n × 1 vector consisting of random disturbances.
Let ŷ = [ŷ 1 ,ŷ 2 , … ,ŷ n ] T = ( T ) −1 T be the vector of predicted y's from ordinary least squares(OLS). The uncentered coefficient of determination R 2 uy is defined as where ‖.‖ is the Euclidean norm. It is well known (see, for example, Triacca and Volodin (2012)) that R 2 uy is equal to the square of the cosine of the angle between y and ̂ , that is Thus 0 ≤ R 2 uy ≤ 1 and it measures how close the vectors y and ̂ are in terms of their directions. When R 2 uy = 1 , y and ̂ are collinear, so that y must be in the column space of , denoted by col( ) . When R 2 uy = 0 , y and ̂ are orthogonal, so that y is in col( ) ⊥ (the orthogonal complement of col( ) ). Some interesting properties concerning the uncentered = + , Triacca and Volodin (2012). A useful discussion of centered vs. uncentered R 2 can be found in (Wooldridge 2016, p.214) and Baltagi (2008, p.72). Now, we introduce the n × n matrix defined as It is called 'hat matrix'. The typical element of H is h ij , denoting the element of row i and column j. We note that ̂ = Hy . Thus the hat matrix transforms the observed vector y into its LS estimate ̂ . It can easily be verified that is idempotent and symmetric. The hat matrix is the orthogonal projector onto col( ) . The matrix that projects orthogonally the vector on col( ) ⟂ is where is the n × n identity matrix. We have that Thus with ⟂ . By the Pythagorean theorem, it follows that Further, we have that where is the angle between 1 = (1, 1, … , 1) T and H1 . Thus and hence Now, we observe that if then ⟨1, H1⟩ = ‖1‖‖H1‖ . This implies that 1 = H1 ∈ col( ) . Thus, we can conclude that if 1 ∉ col( ) , then We will use this result in the sequel. Considering the hat matrix, the uncentered coefficient of determination can be re-written as

Main results
It is important to note that if a non null constant c is added to all observations of the dependent variable, the resulting uncentered coefficient of determination, changes. This happens since the angle between z and ẑ = ∧ , where ∧ = ( T ) −1 T , is different from the angle between y and ŷ.
Interesting questions are: which is the limit of R 2 uz as c → +∞? Which is the limit of the distance between z and ẑ as c → +∞?
We also analyze the behavior of R 2 uz in the case in which a constant c ∈ ℝ is added to all observations of at least an independent variable x i .
We observe that the first issue has been investigated in literature when the matrix X includes a constant (see, for example, (Davidson and MacKinnon 1999, p.75) ). Our framework is different since we will consider the case in which a linear no-intercept model is used. In particular, we assume that 1 ∉ col( ).

Consider the regression model
We have that Since 1 ∉ col( ) , we have that This happens because z = y + c becomes collinear to 1 as c → +∞ . In fact, since the cosine of the angle between z and 1 is it follows that 1 It is interesting to note that if we assume that 1 ∈ col( ) , then and hence It is important to note that if there exists a vector v ∈ col (X) such that ‖1 − v‖ ≈ 0 , then R 2 u1 ≈ 1. In fact, because we have that ‖1 − H1‖ ≈ 0 . This implies ‖1‖ ≈ ‖H1‖ and hence R 2 u1 ≈ 1. Further, we observe that and If 1 ∉ col( ) , we have that and hence it follows that Thus, when the regressors do not include a constant term, the addition of a large constant to all observations of the dependent variable makes very large the distance between the vector of observations and its estimate but this does not necessarily make R 2 uz near to 0. In this case, R 2 uz converges to R 2 u1 as c → +∞ and R 2 u1 may be very close to 1. This is a very unsatisfactory feature of the uncentered coefficient of determination. An appropriate measure of fit should not be affected by the location of the dependent variables. To illustrate this point, we consider the following ten (x, y) pairs of hypothetical observations: (5.1, 0.9), (4.2, 1.1), (6.5,−0.7), (5.3, −1.3), (3.1, 0.8), (6.2, 1.5), (5.8, 0.1), (3.2,−0.1), (4.7, 1.4), (2.7, 1.3). We estimate the following model The value of the uncentered R 2 for this model with c = 0 is 0.156413, indicating a terrible fit on the original data. However, by adding a constant to all observations of the dependent variable we can produce an artificial increase in the uncentered R 2 . Considering c = 5, 20, 100 , we obtain R 2 u,z = 0.881614, 0.922237, 0.929368 , respectively. Clearly, R 2 uz converges to R 2 u1 = 0.930829 as c → ∞. Finally, we note that by adding a sufficiently large constant to at least a column of X, we can make R 2 uz as close as we wish to 1. In order to show this, without loss of generality, we pose with c ∈ ℝ , and we consider the vector ̂ , where ̂= [̂1,̂2, … ,̂k] T = ( T ) −1 T . We have that and Thus we can conclude that Adding a sufficiently large constant to the dependent variable and to at least one independent variable of the model, we can make R 2 uz as close as we wish to 1.

An illustrative example
In this section, in order to illustrate the obtained results, we consider an empirical example. Data have been taken from UK National Weather Service (https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/oxforddata.txt).
In particular, we use the following monthly time series: • Mean daily maximum temperature (tmax) • Mean daily minimum temperature (tmin) • Total rainfall (rain) • Total sunshine duration (sun) for a 17 years period (2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017) at Oxford (UK). We investigate the relationship between tmax and, respectively, rain, sun, and tmin. In all cases, we estimate a simple linear model without intercept, Our first case concerns the relationship between monthly mean maximum temperature and rainfall. The temperature, y i , is expressed on the Celsius scale. Fitting a simple linear regression without intercept gives a value of 0.663447 for R 2 uy . If we estimate the model by using the temperature expressed on the Kelvin scale, that is z i = y i + 273.15 , we have R 2 uz = 0.766925 (we note that R 2 u1 = 0.768007 ). Thus, in this case, by adding a sufficiently . lim c→+∞ R 2 uz = 1.
large constant to all observations of the dependent variable we obtain an artificial increase in the uncentered R 2 . This happens because R 2 u1 > R 2 uy and R 2 uz → R u1 as c → +∞ . Of course, if R 2 u1 had been less than R 2 uy , a reduction in the uncentered R 2 would have occured (see the next case).
Then, we regress tmax on sun. We obtain R 2 uy = 0.950690 , R 2 uz = 0.845403 and R 2 u1 = 0.833621 . Here there is an artificial decrease in the uncentered R 2 . The final case regards the relationship between tmax and tmin. If we estimate the model by using the temperature expressed on the Celsius scale, we have R 2 uy = 0.953874 . If we use a model with the temperature expressed on the Kelvin scale, that is we obtain R 2 uz = 0.999959 , value, as expected, approaching to 1.

Final remarks
In literature, it is often argued that in a linear regression model without intercept, the uncentered R 2 is an appropriate measure of goodness of fit. In this paper, we have shown that, when a linear regression through the origin is used, the uncentered R 2 varies artificially if a constant c ∈ ℝ is added to the observations. In particular, we have considered two cases. First, the constant c is added to all observations of the dependent variable. Second, the constant c is added to all observations of the dependent variable and to all observations of at least an independent variable. We have shown that in both cases there is an artificial variation of the uncentered R 2 . In the first case, the uncentered R 2 reaches the limit value R u1 < 1 , when the constant c goes to infinite. If R 2 u1 > R 2 uy , then by adding a sufficiently large constant to all observations of the dependent variable we obtain an artificial increase in the uncentered R 2 . If R 2 u1 < R 2 uy , then a reduction in the uncentered R 2 is obtained. In the second case, by adding a sufficient large constant we can make R 2 uz as close as we wish to 1. From this point of view, the uncentered R 2 does not seem to be an appropriate measure of goodness of fit for linear regression models without intercept. In fact, rather than measuring the fit in term of Euclidean distance between y and ŷ , the uncentered coefficient of determination measures how close the vectors y and ŷ are in terms of their directions.
Several authors have suggested that the uncentered coefficient of determination is a more appropriate measure of goodness of fit with respect to the centered coefficient of determination, when a regression without a constant is used. However, in this paper, we have shown the uncentered R 2 is not invariant under location change, when regression through the origin is considered. For this reason, also the use of the uncentered R 2 as fitting measure have to be considered with caution.