# Nonlinear regression without i.i.d. assumption

- 75 Downloads

**Part of the following topical collections:**

## Abstract

In this paper, we consider a class of nonlinear regression problems without the assumption of being independent and identically distributed. We propose a correspondent mini-max problem for nonlinear regression and give a numerical algorithm. Such an algorithm can be applied in regression and machine learning problems, and yields better results than traditional least squares and machine learning methods.

## Keywords

Nonlinear regression Minimax Independent Identically distributed Least squares Machine learning Quadratic programming## Abbreviations

- i.i.d.
Independent and identically distributed

- MAE
Mean absolute error

- MSE
Mean squared error

## 1 Introduction

*y*and one or more explanatory variables denoted by

*x*:

Here, *ε* is a random noise. The associated noise terms \(\{\varepsilon _{i}\}_{i=1}^{m}\) are assumed to be i.i.d. (independent and identically distributed) with mean 0 and variance *σ*^{2}. The parameters *w*,*b* are estimated via the method of least squares as follows.

### **Lemma 1**

*1*). Then the result of least squares is

*A*^{+} is the Moore −Penrose inverse^{1} of *A*.

In the above lemma, *ε*_{1},*ε*_{2},⋯,*ε*_{m} are assumed to be i.i.d. Therefore, *y*_{1},*y*_{2},⋯,*y*_{m} are also i.i.d.

When the i.i.d. assumption is not satisfied, the usual method of least squares does not work well. This is illustrated by the following example.

### **Example 1**

*μ*and variance

*σ*

^{2}and denote by

*δ*

_{c}the Dirac distribution, i.e.,

*1*.

We see from Fig. 1 that most of the sample data deviates from the regression line. The main reason is that the i.i.d. condition is violated.

They suggest a genetic algorithm to solve this problem. However, such a genetic algorithm does not work well generally.

Motivated by the work of Lin et al. (2016) and Peng (2005), we consider nonlinear regression problems without the assumption of i.i.d. in this paper. We propose a correspondent mini-max problems and give a numerical algorithm for solving this problem. Meanwhile, problem (2) in Lin’s paper can also be well solved by such an algorithm. We also have done some experiments in least squares and machine learning problems.

## 2 Nonlinear regression without i.i.d. assumption

Nonlinear regression is a form of regression analysis in which observational data are modeled by a nonlinear function which depends on one or more explanatory variables (see, e.g., Seber and Wild (1989)).

*x*

_{i}∈

*X*and

*y*

_{i}∈

*Y*.

*X*is called the input space and

*Y*is called the output (label) space. The goal of nonlinear regression is to find (learn) a function

*g*

^{θ}:

*X*→

*Y*from the hypothesis space {

*g*

^{λ}:

*X*→

*Y*|

*λ*∈

*Λ*} such that

*g*

^{θ}(

*x*

_{i}) is as close to

*y*

_{i}as possible.

*φ*such that

*φ*(

*g*

^{θ}(

*x*

_{1}),

*y*

_{1},⋯,

*g*

^{θ}(

*x*

_{m}),

*y*

_{m}) attains its minimum if and only if

Then the nonlinear regression problem (learning problem) is reduced to an optimization problem of minimizing *φ*.

The average loss is popular, particularly in machine learning, since it can be conveniently minimized using online algorithms, which process fewer instances during each iteration. The idea behinds the average loss is to learn a function that performs equally well for each training point. However, when the i.i.d. assumption is not satisfied, the average loss function method may become a problem.

Here, *n*_{j} is the number of samples in group *j*.

Problem (3) is a generalization of problem (2). Next, we will give a numerical algorithm which solves problem (3).

### **Remark 1**

*Z*

_{1},

*Z*

_{2},⋯,

*Z*

_{k}are drawn from the maximal distribution \(M_{[\underline {\mu },\overline {\mu }]}\) and are nonlinearly independent, then the optimal unbiased estimation for \(\overline {\mu }\) is

This fact, combined with the Law of Large Numbers (Theorem 19 in Jin and Peng (2016*)*) leads to the max-mean estimation of *μ*. We borrow this idea and use the max-mean as the loss function for the nonlinear regression problem.

## 3 Algorithm

Here, *h* is continuous on \(\mathbb {R}^{n}\times V\) and differentiable with respect to *u*.

Problem (4) was considered theoretically by Klessig and Polak (1973)* in 1973 and*Panin (1981*) in 1981. Later in 1987,*Kiwiel (1987) gave a concrete algorithm for problem (4). Kiwiel’s algorithm dealt with the general case in which *V* is a compact subset of \(\mathbb {R}^{d}\) and the convergence could be slow when the number of parameters is large.

In our case, *V*={1,2,⋯,*N*} is a finite set and we give a simplified and faster algorithm.

*f*

_{j}is differentiable. Now, we outline the iterative algorithm for the following discrete mini-max problem

*u*

_{k}(

*k*=0,1,⋯) since

*Φ*is nonsmooth in general. In light of this, we linearize

*f*

_{j}at

*u*

_{k}and obtain the convex approximation of

*Φ*as

*u*

_{k+1}, which minimizes \(\hat {\Phi }(u)\). In general, \(\hat {\Phi }\) is not strictly convex with respect to

*u*, and thus it may not admit a minimum. Motivated by the alternating direction method of multipliers (ADMM, see, e.g.,Boyd et al. (2010

*);*Kellogg (1969)), we add a regularization term and the minimization problem becomes

*d*=

*u*−

*u*

_{k}, the above is converted to the following form

Problem (6) −(7) is a semi-definite QP (quadratic programming) problem. When *n* is large, the popular QP algorithms (such as the active-set method) are time-consuming. So we turn to the dual problem.

### **Theorem 1**

*λ*is the solution of the following QP problem

### *Proof*

See Appendix. □

### **Remark 2**

Problem ( *8* ) −( *9* ) can be solved by many standard methods, such as active-set method (see, e.g., (Nocedal and Wright 2006)). The dimension of the dual problem ( *8* ) −( *9* ) is *N* (number of groups), which is independent of *n* (number of parameters). Hence, the algorithm is fast and stable, especially in deep neural networks.

Set *d*_{k}=−*G*^{T}*λ*. The next theorem shows that *d*_{k} is a descent direction.

### **Theorem 2**

*d*

_{k}≠0, then there exists

*t*

_{0}>0 such that

### *Proof*

See Appendix. □

*F*, the directional derivative of

*F*at

*x*in a direction

*d*is defined as

*F*to attain its minimum (seeDemyanov and Malozemov (1977)) is

*x*is called a stationary point of

*F*.

Theorem 2 shows that when *d*_{k}≠0, we can always find a descent direction. The next theorem reveals that when *d*_{k}=0,*u*_{k} is a stationary point.

### **Theorem 3**

*d*

_{k}=0, then

*u*

_{k}is a stationary point of

*Φ*, i.e.,

### *Proof*

See Appendix. □

### **Remark 3**

When each *f*_{j} is a convex function, *Φ* is also a convex function. Then, the stationary point of *Φ* becomes the global minimum point.

With *d*_{k} being the descent direction, we use line search to find the appropriate step size and update the iteration point.

**Algorithm.**

**Step 1. Initialization**

Select arbitrary \(u_{0}\in \mathbb {R}^{n}\). Set *k*=0, termination accuracy *ξ*=10^{−8}, gap tolerance *δ*=10^{−7}, and step size factor *σ*=0.5.

**Step 2. Finding Descent Direction**

*u*

_{k}. Compute the Jacobian matrix

*δ*(see, e.g., Nocedal and Wright (2006)).

Take *d*_{k}=−*G*^{T}*λ*. If ∥*d*_{k}∥<*ξ*, stop. Otherwise, goto Step 3.

**Step 3. Line Search**

*j*such that

Take *α*_{k}=*σ*^{j} and set *u*_{k+1}=*u*_{k}+*α*_{k}*d*_{k}, *k*=*k*+1. Go to Step 2.

## 4 Experiments

### 4.1 The linear regression case

*l*

^{2}distance and

*l*

^{1}distance are used as measurements.

*l*

^{2}and

*l*

^{1}distances.

Comparisons of the two methods

Method | | |
---|---|---|

Traditional method | 1.2789 | 1.2878 |

Mini-max method | 0.1755 | 0.1848 |

Lin et al. (2016) have mentioned that the above problem can be solved by genetic algorithms. However, the genetic algorithm is heuristic and unstable especially when the number of groups is large. In contrast, our algorithm is fast and stable and the convergence is proved.

### 4.2 The machine learning case

We further test the proposed method by using the CelebFaces Attributes Dataset (CelebA)^{2} and implement the mini-max algorithm with a deep learning approach. The dataset CelebA has 202599 face images among which 13193 (6.5%) have eyeglass. The objective is eyeglass detection. We use a single hidden layer neural network to compare the two different methods.

We randomly choose 20000 pictures as the training set among which 5% have eyeglass labels. For the traditional method, the 20000 pictures are used as a whole. For the mini-max method, we separate the 20000 pictures into 20 groups. Only 1 group contains eyeglass pictures while the other 19 groups do not contain eyeglass pictures. In this way, the whole mini-batch is not i.i.d. while each subgroup is expected to be i.i.d.

*σ*is an activation function in deep learning such as the sigmoid function

*n*, and

*m*of them are classified correctly. Then the accuracy is defined to be

*m*/

*n*.)

The average accuracy for the mini-max method is 74.52% while the traditional method is 41.78%. Thus, in the deep learning approach with a single layer, the mini-max method helps to speed up convergence on unbalanced training data and improves accuracy as well. We also expect improvement with the multi-layer deep learning approach.

## 5 Conclusion

In this paper, we consider a class of nonlinear regression problems without the assumption of being independent and identically distributed. We propose a correspondent mini-max problem for nonlinear regression and give a numerical algorithm. Such an algorithm can be applied in regression and machine learning problems, and yields better results than least squares and machine learning methods.

## 6 Appendix

### 6.1 Proof of Theorem 1

*e*=(1,1,⋯,1)

^{T}, the above problem is equivalent to

*λ*

^{T}

*e*≠0, then the above is −

*∞*. Thus, we must have 1−

*λ*

^{T}

*e*=0 when the maximum is attained. The problem is converted to

*d*=−

*G*

^{T}

*λ*and the above problem is reduced to

### 6.2 Proof of Theorem 2

*u*=

*u*

_{k},

*d*=

*d*

_{k}. For 0<

*t*<1,

*d*is the solution of problem (5), we have that

*t*>0 small enough, we have that

### 6.3 Proof of Theorem 3

*u*=

*u*

_{k}. Then,

*d*

_{k}=0 means that ∀

*d*,

*d*∥ is small enough, we have that

*d*∥ small enough,

*d*=

*r*

*d*

_{1}with sufficient small

*r*>0, we have that

*r*→0+,

Thus, we fulfill the proof by combining with (12).

## Footnotes

- 1.
For the definition and property of Moore −Penrose inverse, see (Ben-Israel and Greville 2003).

- 2.

## Notes

### Acknowledgments

The authors would like to thank Professor Shige Peng for useful discussions. We especially thank Xuli Shen for performing the experiment in the machine learning case.

### Authors’ contributions

MX puts forward the main idea and the algorithm. QX proves the convergence of the algorithm and collects the results. Both authors read and approved the final manuscript.

### Funding

This paper is partially supported by Smale Institute.

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

## References

- Ben-Israel, A., Greville, T. N. E.: Generalized inverses: Theory and applications (2nd ed.)Springer, New York (2003).zbMATHGoogle Scholar
- Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Found. Trends Mach. Learn. 3, 1–122 (2010).CrossRefGoogle Scholar
- Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004). https://doi.org/10.1017/cbo9780511804441.005.
- Demyanov, V. F., Malozemov, V. N.: Introduction to Minimax. Wiley, New York (1977).Google Scholar
- Jin, H., Peng, S.: Optimal Unbiased Estimation for Maximal Distribution (2016). https://arxiv.org/abs/1611.07994.
- Kellogg, R. B.: Nonlinear alternating direction algorithm. Math. Comp. 23, 23–38 (1969).MathSciNetCrossRefGoogle Scholar
- Kendall, M. G., Stuart, A.: The Advanced Theory of Statistics, Volume 3: Design and Analysis, and Time-Series (2nd ed.)Griffin, London (1968).Google Scholar
- Kiwiel, K. C.: A Direct Method of Linearization for Continuous Minimax Problems. J. Optim. Theory Appl. 55, 271–287 (1987).MathSciNetCrossRefGoogle Scholar
- Klessig, R., Polak, E.: An Adaptive Precision Gradient Method for Optimal Control. SIAM J. Control. 11, 80–93 (1973).MathSciNetCrossRefGoogle Scholar
- Legendre, A. -M.: Nouvelles methodes pour la determination des orbites des cometes. F. Didot, Paris (1805).Google Scholar
- Lin, L., Shi, Y., Wang, X., Yang, S.:
*k*-sample upper expectation linear regression-Modeling, identifiability, estimation and prediction. J. Stat. Plan. Infer. 170, 15–26 (2016).MathSciNetCrossRefGoogle Scholar - Lin, L., Dong, P., Song, Y., Zhu, L.: Upper Expectation Parametric Regression. Stat. Sin. 27, 1265–1280 (2017a).Google Scholar
- Lin, L., Liu, Y. X., Lin, C.: Mini-max-risk and mini-mean-risk inferences for a partially piecewise regression. Statistics. 51, 745–765 (2017b).MathSciNetCrossRefGoogle Scholar
- Nocedal, J., Wright, S. J.: Numerical Optimization. Second Edition. Springer, New York (2006).zbMATHGoogle Scholar
- Panin, V. M.: Linearization Method for Continuous Min-max Problems. Kibernetika. 2, 75–78 (1981).MathSciNetzbMATHGoogle Scholar
- Peng, S.: Nonlinear expectations and nonlinear Markov chains. Chin. Ann. Math.26B(2), 159–184 (2005).MathSciNetCrossRefGoogle Scholar
- Seber, G. A. F., Wild, C. J.: Nonlinear Regression. Wiley, New York (1989).CrossRefGoogle Scholar

## Copyright information

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.