Pivot Property in Weighted Least Regression Based on Single Repeated Observations

How genuine repeated observations influence the result of linear regression could be an interesting topic in regression analysis. In this paper, we discuss the pivot behavior in weighted linear regression based on certain data pattern and give an explanation about the pivot behavior.


Introduction
Repetition of identical observations is an important phenomenon in a variety of domains such as clinical trials, categorical survey, traffic data [1,2] and how genuine repeated observations influence the result of linear regression could be an interesting topic in regression analysis, As [3] stated, 'It is important to understand that repeated runs must be genuine repeats and not just repetitions of the same reading'. This kind of data with genuine repeats could be easily found in categorical data analysis and obtaining estimators for unknown parameters within the framework of statistics always comes down to a problem of optimization using the least square method and that could be the origin of our study in this paper. In [4], the author investigates how single repeating data point (x i , y i ) affects the ordinary least squares regression and the result shows that the regression lines will pivot at some certain point under specific condition. i.e. ∑ n j≠i x j ≠ (n − 1)x i where the pivot point P has coordinate (x p , y p ) as (1) The study background of [4] is based on predicting the result of sports games using matches information, which makes the explanatory variable and the response variable are both precise positive integer numbers(like rank or number of games) where the occurrences of repeated observation always exist. In [4], we find that if certain data point was repeated several times and we employ simple linear regression based on updated dataset each time, we could find all the simple linear regression lines would intersect at some pivot point P (Fig. 1).
However,once repeated data point(for simplicity,we denote it as (x 1 , y 1 ))is not identical with P.We would find the series of squares of residuals after regression for repeated points (x 1 , y 1 ) will converge to 0. The series of squares of residuals for nonrepeated data points in datasets ̂k 2 2 ,̂k 3 2 …̂k n 2 will converge to a fixed non-zero constant as the number of repeated times k increases, which could lead to heteroscedasticity and does not satisfy Gauss-Markov theorem through hypothesis testing which leads to failure of the best linear unbiased estimator(BLUE) of the coefficients given by Ordinary Least Squares regression.
(2) For example, we use the datasets given from [4] (1, 6), (2, 2), (5, 1), (7,2), if we repeat (2, 2) for additional 19 times and do regression using OLS method, we could find that the simple linear model is no longer valid based on the result of either t-test or F-test, we can also exert heteroscedasticity tests using Breusch-Pagan test or Whites General Test to detect this issue. 1 L i n e a r r e g r e s s i o n model : 2 y˜1 + x1 3 4 Estimated Co e f f i c i e n t s : We should mention that as the sample size increases, there could be a trade-off between invalidity of the model through p value and sample size [5]. For a more precise test, we could use alternative and exact solution of power and sample size calculations for model validation in linear regression analysis. [6,7] Meanwhile, bringing back to real-world cases, different observations have a different reliable factor; for example, we cannot treat the result of friendly match and finals match with equal importance. Thus we should treat various observations with different importance factor respectively, which inspire us to investigate whether the pivot behaviour exists in more robust linear models. In this paper, we mainly focus on the pivot behaviour based on Weighted least square(WLS) method.

Regression Lines and Pivot Property Under Simple Linear Regression Using WLS Method
To be more generalized,supposing we start with the points (x 1 , y 1 ), (x 2 , y 2 ), … , (x n , y n ) and we consistently use (x 1 , y 1 ) as the repeated point R in following work.Note that we repeat (x 1 , y 1 ) for additional k times so we have design matrix X ∈ ℝ (n+k)×2 as Assume we have prior information about the weight matrix W ∈ ℝ (n+k)×(n+k) as w 1,1 , … w 1,k+1 ∈ ℝ + are corresponding weights for total k + 1 repeated points (x 1 , y 1 ) respectively and w 2 , … w n ∈ ℝ + are weights for non-repeated points.Note that we are not engaged in discussing how to determine the optimal weight scheme for each observation in this article, it depends on personal judgement on the precision of measurement or specific weight scheme which could make estimator for unknown parameters significant under WLS method.Meanwhile,we should notice the weights w 1,1 , … , w 1,k+1 for total k + 1 times repeated data points (x 1 , y 1 ) do not need to be identical and their sum do not need to add up to 1 under our parameterization.

Suppose the model under consideration is
where the intercept is b k+1 is the value of E(Y|X = x) when x equals zero and the slope m k+1 is the rate of change in E(Y|X = x) for a unit change in X both given under k + 1 times repeated data points (x 1 , y 1 ). where ( x;k , y;k ) turns to (x,ȳ) as we assign equal weight for each observation, i.e. w 1,1 = ⋯ = w 1,k+1 = w 2 = ⋯ = w n ,which is accord with our knowledge that ordinary least square regression line must go through the mean point of dataset. Meanwhile,the limiting position of x;k and y;k as k → +∞ depends on the conver- which is not approach to repeated point but a new point R skew . Note that the slope of the line segment connecting M k and M k+1 is given by: if (15) keeps invariant for a different choice of k,we can conclude that the weighted average points of data set is collinear for the quotient is independent of k.In fact,it mainly relies on our weight scheme setup for updated dataset.For example,if we assume that new added repeated point (x 1 , y 1 ) will not influence the weights of nonrepeated points w 2 , … w n , (15) will be independent of k.On the contrary case,if we assume that the newly added repeated point will increase/decrease the original weights for non-repeated points with a linear pattern,for example,w i = w i + c i k,the collinear property for weighted average points M 0 , M 1 … M k could disappear for different selection of c i .

Finding Pivot Point
Enlightened by method we used above,we tried to apply it to first component of normal equation (7) to find the pivot behavior.When we have solving by coordinate system exchange from Cartesian coordinate to R-centric coor- (16) as From (17) we clearly find that for any choice of k,all regression lines will intersect at point P under R-centric coordinate with proper weight scheme or under Cartesian coordinate.
We should notice that the regression lines pivot at P with two prerequiste conditions. (7) could be written as where we find that we have a cloud of parallel regression lines with identical slope but different intercepts We can find that b is not free from k The figure below depicts the case when ∑ n i=2 w i i = 0 holds for all equal weight for repeated and non-repeated data points. We amend the datasets as (6, 6), (2, 2), (5, 1), (7,2),if we repeat the third data point (5, 1),as it satisties the condition 6 + 2 + 7 = 5 * (4 − 1),we have a set of parallel regression lines without even a single intersection between any two regression line (Fig. 2).
2. Still,the weight scheme plays a latent but vital role in determining whether the pivot point exist.Actually,even the newly added repeated point affects the weights of non-repeated data points,provided that (19) keeps invarient,we can always find a pivot behavior.For example,if w i ∝ k or w i = c i ∕k for i = 2, … , n we will still find a pivot point for regression lines. (20)

Futher About Pivot Point
Let us trace back to the origin of pivot point,in order to simplify our calculation,we express the following calculation in R-centric coordinate.The normal equation under R-centric coordinate system is given below: If left hand side matrix X T WX is non-singular,we could solve this normal equation with specific repeated time k In order to control variations between weights,we assume that newly added repeated point will not affect the weights of n + k − 1 data points i.e. w 1,1 , … , w 1,k , w 2 , … , w n keep the same when w 1,k+1 added.

Results Under Some Certain Weight Schemes
The followings we discussed are results as we apply different weight scheme to datasets: Annals of Data Science (2020) 7(2):291-306 Same as above condition,we can also use coordinate of R skew ((13)converted to R-centric coordinate system first) and the pivot point P to calculate the slope of limiting position of regression line.
The figure below shows that we assign weight to k-th newly added (x 1 , y 1 ) (total repeated 100 times)with weight w 1,k = 1 2 k where the remaining non-repeated points(x 2 , y 2 ), (x 3 , y 3 ), (x 4 , y 4 ) with invariant weights w 2 = 2, w 3 = 3, w 4 = 4 respectively.We can still find a pivot point for regression lines but will no longer approach to (x 1 , y 1 ) no matter how many repeated times added,which could be compared to Fig. 1 even we only repeat (x 1 , y 1 ) for 20 times (Fig. 3).
From the figure,we notice that the limiting position of regression line could not go through repeated point R as k → +∞(or the limiting line may not go through origin (0, 0) in R-centric coordinate system.) for we have sum of weights of repeated point as finite constant c ∈ ℝ as k → +∞.

Explanations About Pivot Point
We denote l = mx + b as the limiting position of these regression lines.I am going to show that for any given repeated time k,the line k = m k + b k is a weighted average of l k ∶l and l 0 with weighted w k ,which is equivalent to prove: where We have get the expression for m k and b k from (22) and m and b from (25)  That is,each l k is a weighted average of l 0 and l .So the intersection between l 0 and l must appear on each weighted line between l 0 and l ,which is exactly the Pivot point we find.

The Existence of Pivot Hyperplane in Higher Dimensional Cases
In this section,we consider pivot behavior under multiple regression cases.(the number of regressors is m where m ≥ 1 ∈ ℕ + )Supposing we start with the n unique observations (x 11 , x 12 , … x 1m , y 1 ) , (x 21 , x 22 , … x 2m , y 2 ) , ..., (x n1 , x n2 , … x nm , y n ) and we consistently use (x 11 , x 12 , … x 1m , y 1 ) as repeated observations R for updated dataset .Note that we repeat (x 11 , x 12 , … x 1m , y 1 ) for additional k times so we have design matrix X k ∈ ℝ (n+k)×m+1 as Annals of Data Science (2020) 7(2):291-306 and we have prior information about the weight matrix W k ∈ ℝ (n+k)×(n+k) as w 1,1 , … w 1,k+1 ∈ ℝ + are corresponding weights for total k + 1 repeated observations (x 11 , x 12 , … x 1m , y 1 ) respectively and w 2 , … w n ∈ ℝ + are weights for nonrepeated points.Same as we discussessed in section 2 we should notice the weights w 1,1 , … , w 1,k+1 for total k + 1 times repeated observations do not need to be identical and their sum ∑ k+1 i=1 w 1,k+1 + ∑ n i=2 w i do not need to add up to 1 under our parameterization.
Suppose the sample regression model under consideration can be written in the form where in general we have 1. Y k is an (n + k) * 1 vector of observations, 2. X k is an (n + k) * (m + 1) matrix of known form we mentioned above 3. k is a (m + 1) * 1 vector of parameters, k i , i = 1, 2, … , m is parameters for coefficient of regressors x i ,b m is parameter for constant. 4. k is an (n + k) * 1 vectors of errors.
To minimize residual sum of squares through weighted least square method,the normal equations become: Using the same method in Sect. 2.1, normal equation 36 can be expressed under R-centric coordiante where : where Annals of Data Science (2020) 7(2):291-306 From first m equations in 37,we notice that the WLS multiple regression lines would include m specific points,namely: in m + 1 dimensional space regardless any selection of repeated times k under R-centric coordiante.
Once the R-centric coordinate matrix has full rank, i.e.
We can conclude that there exists a pivot hyperplane with dimension m in m + 1 variable space,which is consistent with our conclusion in 2.1.

Simulation Study of Pivot Behavior
In this section,I will try to find the pivot behavior in three dimensional data cases under equal weight scheme. Suppose we want to pass a plane y = k 1 x 1 + k 2 x 2 + b through the points ( More generally,a point with Cartesian coorinates (x i1 , x i2 , y i ) has R−centric coordinates i1 = x i1 − x 11 , i2 = x i2 − x 12 , i = y i − y 1 .Fitting the plane = k 1 1 + k 2 2 + b to the data is equivalent to with associated normal equation By comparing the first 2 equations from the matrix equation above,we have i=2 i2 ≠ 0,we can divide to find corresponding points on the plane = k 1 1 + k 2 2 + b,namely notice that (44) is absent from repeated times k,which means no matter how many times we repeated,the fitting plane should contain points (44). If (44) are not identical,we can easily find the pivot line goes through (44),which is given by symmetric equations of the line and the graph below provide a illustration for pivot behavior in three dimensional case (Fig. 4):

Conclusion
We study the pivot behaviour in regression using weighted least square method and give a plausible explanation for it in this paper. As we can see, the weighting scheme plays a vital role throughout the whole study procedure. The dependency between weight scheme and pivot behaviour would be our future research topic, [8] could be a helpful tool for determining corresponding weights for observations and the efficiency loss for estimators in regression analysis under repeated observations could be estimated simulated using Monte Carlo method [9]. Meanwhile, the explanation of pivot behaviour could be applied in different scientific areas to give further explanations of impact coming from repeated data points according to the knowledge of (45)