Skip to main content

Fitting Small Piece-Wise Linear Neural Network Models to Interpolate Data Sets

  • Chapter
  • First Online:
Advances in Data Science

Part of the book series: Association for Women in Mathematics Series ((AWMS,volume 26))

  • 767 Accesses

Abstract

The simplest neural network models for functions are compositions of layer functions, where each layer function applies a nonlinear activation function σ, which is not a polynomial, component-wise to an affine linear functions WX + B. The rows of W are referred to as ‘weights’ and B is the ‘bias vector’. The Universal Approximation Theorem for Neural Networks implies that any continuous function on a compact set can be approximated arbitrarily precisely using this class of models. This rich class of functions is proving to be remarkably useful and underpins the explosion of artificial intelligence applications. Neural network model architectures typically use compositions of many layer functions defined in terms of a huge number of parameters. The parameters are typically fit by minimizing a loss function using gradient descent. Mathematically, one is seeking a representation computed by a composition of all but the final layer function that accurately predicts the function values using the final layer function (essentially an affine linear function). However, the use of many layers in combination with gradient descent results in models which are not easy to understand and visualize. This makes it difficult for mathematicians and engineers who are interested in learning more about neural networks and their mathematical underpinnings to acquire an initial concrete understanding. In this paper we focus on explicitly demonstrating the universal approximation property of neural network models for the most commonly used activation function, the Relu function. Relu models provide closed form expressions for a huge class of piece-wise linear functions, a class of functions not typically studied. We exploit the particular nature of Relu activation function to define three small piece-wise linear neural network models which interpolate a given finite set of example training and label data while having a minimal number of linear pieces. We avoid the use of gradient descent to fit the models by proving for these simple models that the interpolation equations for the top layer weight matrix W and bias B have a simple closed form algebraic solution, providing the first layer weights satisfy a mild generic property and the biases are defined in a particular data dependent manner. We also prove that if the weights are further restricted to be norm one and minimize a sequential variation property, the piecewise-line models avoid over-fitting in the sense that they have a minimal number of pieces. We prove that sequential variation is constant on a finite number of sectors and hence easy to minimize and sample. The number of sectors is determined by the number of data points, not the dimension, avoiding the ‘curse of dimensionality’. The minimum sequential variation distribution characterizes the distribution parameters of all of the simplest models, enabling computation of average models. This is not possible when gradient descent is used. One of our models adds an initial layer which uses the above concepts to construct a small binary tree representation, representing the training data in binary coordinates. The models can be simply computed using elementary mathematical operations for small and possibly large data sets. The concepts and models are illustrated by graphing results on small two-dimensional data sets. We hope that the example models will provide some concrete insights into the Relu neural network model for representing functions. No knowledge of neural network models or deep learning methods is assumed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Arora, S., Du, S., Hu, W., Li, Z., Wang, R.: Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In ICML (2019).

    Google Scholar 

  2. Arora, S., Du, S., Hu, W., Salakhutdinov, R., Wang, R.: On exact computation with an inifinitely wide neural net. In NeurIPS (2019).

    Google Scholar 

  3. Bach, F.: Breaking the Curse of Dimensionality with Convex Neural Networks. J Mach Learn Res 18, 1–53 (2017).

    MATH  Google Scholar 

  4. Balestriero, R.,Baraniuk, R. : A spline theory of deep networks. Proc. Int. Conf. Mach. Learn. (ICML’18), Jul. 2018.

    Google Scholar 

  5. Balestriero,R., Baraniuk, R.: Mad Max: Affine Spline Insights into Deep Learning. arXiv:1805.06576.

    Google Scholar 

  6. Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., Marcotte, P. : Convex neural networks. In NIPS (2005).

    Google Scholar 

  7. Chizat, L., Bach, F. : On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In NIPs (2018).

    Google Scholar 

  8. Cooper, Y. : The loss landscape of overparametrized neural networks. https://arxiv.org/abs/1804.10200

  9. Cybenko, G. : Approximation by superpositions of a sigmoidal function. Math Control Signal, 2, 303–314 (1989).

    Article  MathSciNet  Google Scholar 

  10. Du, S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neuralnetworks. arXiv preprint arXiv:1810.02054 (2018).

    Google Scholar 

  11. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, 1984.

    Google Scholar 

  12. Guilhoto, L.F.: An Overview Of Artificial Neural Networks for Mathematicians. http://math.uchicago.edu/~may/REU2018/REUPapers/Guilhoto.pdf.

  13. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT press (2016).

    Google Scholar 

  14. Haeffele, B.D., Vidal, R.: Global Optimality in Neural Network Training. CVPR (2017).

    Google Scholar 

  15. Hornik, K., Stinchcombe, M., White, H: Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366 (1989).

    Article  Google Scholar 

  16. Jacot, A., Gabriel, F., Hongler, C. : Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.

    Google Scholar 

  17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In NIPS(2012).

    Google Scholar 

  18. Kunin, D., Bloom, J., Goeva, A., Seed, C.: Loss Landscapes of Regularized Linear Autoencoders. PMLR 97:3560-3569 (2019). http://proceedings.mlr.press/v97/kunin19a.html

    Google Scholar 

  19. Klusowski,J.: Sparse Learning with CART. NeurIPS 2020.

    Google Scholar 

  20. Le, Q.: A Tutorial on Deep Learning Part 1: Nonlinear Classifiers and The Backpropagation Algorithm. http://ai.stanford.edu/~quocle/tutorial1.pdf.

  21. Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867 (1993).

    Article  Google Scholar 

  22. Lin,J.,Zhong,C., Hu,D., Rudin,C., Seltzer,M.: Generalized and Scalable Optimal Sparse Decision Trees. ICML 2020.

    Google Scholar 

  23. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What Does BERT Look At? An Analysis of BERT’s Attention. In ACL (2019).

    Google Scholar 

  24. Mei,S., Montanari, A., Montanari, S., Nguyen,A., Nguyen, P. M.: A mean field view of the landscape of two-layers neural networks. arXiv preprint arXiv:1804.06561(2018).

    Google Scholar 

  25. Minsky,M., Papper,S.: Perceptrons: An Introduction to Computational Geometry,MIT Press, (1969).

    MATH  Google Scholar 

  26. Needell,D., Ward,R.: Stable Image Reconstruction Using Total Variation Minimization. SIAM J. Imaging Sci., 6(2), 1035–1058.

    Google Scholar 

  27. Ongie, G., Willett,R., Soudry,D., Srebo,N.: A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case. ICLR 2020,arxiv.org/1910.01635.

    Google Scholar 

  28. Senior, A.W., Evans, R., Jumper, J. et al.: Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710,2020.

    Article  Google Scholar 

  29. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker , L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Fan, H., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of Go without human knowledge. Nature. 550 (7676): 354–359, 9 October (2017).

    Google Scholar 

  30. Vidal, R.: Mathematics of Deep Learning. http://cis.jhu.edu/~rvidal/talks/learning/Tutorial-Math-Deep-Learning-2018.pdf.

  31. Zhang, D., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In ICLR (2017).

    Google Scholar 

Download references

Acknowledgements

The author thanks the National Science Foundation for support under grant number 1934924 to Rutgers University, ICERM for hosting and supporting the July, 2019 WiSDM Research Collaboration Workshop in Mathematica and Data Science, and the National Science Foundation for the 5-year AWM ADVANCE grant, Sept 1, 2015 through August 31, 2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Linda Ness .

Editor information

Editors and Affiliations

Appendix: Results on Example 2D Data Sets

Appendix: Results on Example 2D Data Sets

In this section we describe four small two-dimensional data sets: the Xor data set, a Generalized Xor data set, a small synthetic movie ratings data set, and a cluster data set. We then describe their sequential variation functions on the unit circle and models for each of the data sets computed from a sample of direction sets. The descriptions reference figures placed in section “Result Figures”. The captions in the figure summarize key points illustrated in the figures.

1.1 Description of Sequential Variation Results

In the figures illustrating sequential variation, the intervals where sequential variation is minimized are indicated by color-coding. The points on the unit circle corresponding to non-generic (NG) directions, which form the boundaries of the intervals where the sequential variation assumes constant values are also shown and colored in black.

The Xor data set is shown in Fig. 1. Its generic directions colored by their sequential variation value are shown in Fig. 2. The four non-generic directions for D are color-coded in black. All-generic directions have the same sequential variation value of two, so no distinguished directions detected. Sequential variation is not well-defined for the non-generic directions.

The generalized Xor data set on an eight by eight grid and its color-coded values is shown in Fig. 6. Its generic directions color-coded by their sequential variation values are shown in Fig. 7. This is a challenging data set because the function values of a point cannot be predicted from the values of its nearest neighbors. The non-generic directions are color-coded in black. Sequential variation is not constant for this data set. There are four disjoint intervals of minimum SV directions colored in dark blue (two symmetric pairs, paired by projection in the origin).

A small synthetic movie ratings data set is shown in Fig. 12. This data set was obtained from the deep learning tutorial [20]. The data points are two people’s ratings for twelve movies. A third person’s Like and Dislike values (blue and red, respectively) are shown for eleven of twelve movie data points. The remaining movie data point is shown as a large black diamond, to indicate that its like/dislike value needs to be predicted by a model. From the graph we see that there are a whole swath of lines that separate the liked movies from the disliked movies. In fact, this is corroborated by one pair of intervals where the sequential variation is minimized. It also appears the test point might be closer to the liked movies. We can use this observation to sanity check the model predictions.

A clustered data set and its color-coded values is shown in Fig. 17. The color-coding is Red = 1, Green = 2, Blue = 3, Magenta = 4. Two clusters are green (value 2); there is only one cluster for each of the other colors (values). For this data set there are four very small direction intervals (two pairs) where sequential variation is minimized and a lot of non-generic points. These are the small intervals of directions where all of the projected clusters are separated. Figure 18 shows the four small color-coded minimum SV direction intervals (two pairs of opposites) and shows the non-generic points in black.

However, the geometry of the clustered data set indicates that it is much easier to find lines that separate cluster(s) with one value from all of the others. This motivates computation of the sequential variation functions for boolean functions y v = 1v(y(D) associated with each value. The cluster data set function y has four values 1,2,3, and 4. The intervals where sequential variation is minimized for the functions y 1, y 2, y 3, and y 4 are shown in Figs. 19, 20, 21 and 22. These figures show that for the cluster data set, the minimum SV (y v) intervals are larger for each of the individual values v than for SV (y). Projections of the data set on an SV  minimizing direction for SV (y v) optimally separate the data points with value v from the other data points.

1.2 Description of Model Results

The models for each data set are computed on a sample of finely spaced grid points from a rectangle containing each data set and color-coded by model value. The given training data set is graphed as black points. In some cases, a single test value is graphed as a large diamond data point to enable visual comparison of the consistency of the values of the different models for one data point. All models interpolate the given data set. To reduce the number of models graphed, and to given a feeling for the different geometry of the models, the average of each type of model over the direction set samples is often shown.

1.2.1 Model Results for the Xor Data Set

The xor data set has one direction set (because our algorithm coalesces adjacent SV minimizing intervals with the same SV value into one interval). A sample of a Two Layer One Weight Model, a Two Layer Sum Model and a Three Layer Binary Model are shown in Figs. 3, 4, and 5. All of these interpolate the xor data set.

Fig. 4
figure 4

Color coding of values of a 2LS Model which interpolates the Xor data set (shown as Black Points); equivalent to another 2L1W model because there is only one direction set

Fig. 5
figure 5

Color coding of values of an Xor BIN Model which interpolates the Xor data set (shown as Black Points); for BIN models the values are constant on the binary sets

Fig. 6
figure 6

Generalized 7x7 Xor data set color coded by GXor value, red indicates value 0 and blue indicates value 1

Fig. 7
figure 7

Color-coding indicates SV values for generic directions of GXOR. Non-generic directions are Black; the two pairs of SV minimizing intervals are colored in the darkest blue

Fig. 8
figure 8

Color coding of values of a 2L1W Model interpolating the Generalized Xor data set (shown as Black points) for a randomly chosen weight in one of two pairs of SV minimizing intervals; color coding shows values are constant on directions perpendicular to the weight

Fig. 9
figure 9

Color coding of values of a 2L1W Model interpolating the Generalized Xor data set (shown as Black points) for a randomly chosen weight in the second of two SV minimizing intervals; color coding shows values are constant on directions perpendicular to the weight

Fig. 10
figure 10

Average over directions sets of 2LS models interpolating generalized Xor data set (shown as Black Points); different directions result in cubical pattern

Fig. 11
figure 11

Average over directions sets of BIN models interpolating generalized Xor data set (shown as Black Points); BIN models are constant on the binary sets

Fig. 12
figure 12

Ratings data set: red = ‘Dislike’, Blue = ‘Like’, Black = ‘Test Point’; note test point (black diamond) is closer to blue points

Fig. 13
figure 13

Color coding of SV values for generic directions of Ratings Data Set, non-generic directions in black; SV minimized on darkest blue symmetric pair of intervals

Fig. 14
figure 14

Color coding of values of a 2L1W Model interpolating the Ratings data set (shown as black points) for a randomly chosen weight in SV minimizing interval pair; note the predicted value on the test point is close to 1 than 0

Fig. 15
figure 15

Color coding of values of average of 2LS Models interpolating the ratings data set (shown as black points); note the predicted value on the test point

Fig. 16
figure 16

Color coding of values of three layer BIN model interpolating the ratings data set (shown as black points); note the predicted value on the test point is 1 = ‘Like’

1.3 Model Results for the Generalized Xor Data Set

The generalized Xor data set is graphed in Fig. 6. Figure 7 shows that there are two pairs of intervals where sequential variation is minimized. Hence sampling from each pair of the intervals results in two Two Layer One Weight (2L1W) models representative of all 2L1W models using a weight which minimizes sequential variation. The graphs of these two models are shown in Figs. 8 and 9. Geometrically, 2L1W models generalize in strips perpendicular to their generic weight direction.

There is one type of Two Layer Sum model for GXOR for each of the four direction sets. The direction sets are computed by minimizing the sequential variation for the associated functions y v. Since the GXOR function has only two binary values, the associated value functions simplify: y v = y for v = 1 and y v = mod((1 + y, 2) for v = 0. Sampling from them is equivalent to sampling twice from each of the intervals minimizing sequential variation for GXOR or twice from the pair of intervals minimizing sequential variation for GXOR. The average of the four resulting Two Layer Sum models is shown in Fig. 10

There is one type of Three Layer Binary Model (BIN) for each Direction Set and each type of minimizing sequential variation direction for the resulting binary coordinates. The average of the BIN models after one sampling run is shown in Fig. 11.

1.3.1 Model Graphs for the Synthetic Movie Ratings Data Set

The small synthetic movie ratings data set is graphed in Fig. 12. A test point is graphed as a black diamond. Figure 13 shows that there is one pair of intervals where the sequential variation is minimized. This intuitively agrees with the geometry of the data set. Sampling finds the weight for a representative Two Layer One Weight model. The values of the 2L1W model for a rectangle containing the points are shown in Fig. 14. There is only one type of Two Layer Sum model since there is only one direction set, whose two weights are obtained by sampling twice from the pairs of intervals where sequential variation is minimized. The values of a Two Layer Sum model for a rectangle containing the data points is shown in Fig. 15

Fig. 17
figure 17

Cluster data set color coded by the four values of function: Red = 1, Green = 2, Blue = 3, Magenta = 4; note two clusters are color coded green

Fig. 18
figure 18

Cluster data set: small minimizing SV direction intervals in green and non-generic directions in black; there are four very small direction intervals (two pairs); fact: all of the projected clusters are separated in these directions

Fig. 19
figure 19

Cluster data set: minimum SV direction intervals for y v v = 1; non generic directions in black. The intervals are much larger than for y

Fig. 20
figure 20

Cluster data set: minimum SV direction intervals for y v v = 2 ; non generic directions in Black. The intervals are much larger than for y

There is one type of Three Layer Binary model for the synthetic move ratings data set shown in Fig. 16.

Fig. 21
figure 21

Cluster Data Set: Minimum SV direction intervals for y v v = 3; non generic directions in black. The intervals are larger than for y

Fig. 22
figure 22

Cluster data set: minimum SV direction intervals for y v v = 4. Non generic directions in black. The intervals are much larger than for y. This together with the previous three figures shows that there are two direction sets for the cluster data set

The color codings for each of the graphed models enables comparison of the predicted values for the test points (graphed as a diamond). While the three types of average models (2L1W, 2LS, and BIN) for different samples varying in their predictions of the test point, an average of the average models over two hundred samples showed consistent predictions. Specifically the averages for 2L1W, 2LS, and BIN, were 0.6441, 0.6492, and 0.7433, respectively. Thus the person assigning the likes and dislikes to the ratings is more likely to like the movie represented by the test point. This is consistent with natural guess from the data set geometry in Fig. 12.

1.3.2 Model Graphs for the Cluster Data Set

The cluster data set is graphed in Fig. 17. Figure 18 shows the two pairs of small intervals where the sequential variation is minimized. Thus there are two types of Two Layer One Weight Models. Values for exemplars of the two types of TL1W models (using weights sampled from the two pairs of intervals) are graphed in Figs. 23 and 24.

Fig. 23
figure 23

Clustered data set: 2L1W model—weight randomly sampled from the first of two direction sets

Fig. 24
figure 24

Clustered data set: 2L1W weight randomly sampled from the second of two direction sets

There are two types of Two Layer Sum models for the clustered data set because there are two direction sets. Recall that each direction set has an interval minimizing sequential variation for y v for each value v of y. Here y has four values. Figures 19, 20, 21, 22 show that values one, three, and four each have only one pair of minimizing intervals, while value two has two pairs of minimizing intervals. Hence there are only two direction sets (Figs. 23 and 24). Figure 25 shows the average of the two Two Layer Sum models for the clustered data set using weights sampled from the intervals for the direction sets.

Fig. 25
figure 25

Clustered data set: average over direction sets of 2LS models

There are 22 types of Three Layer Binary Models for the clustered data set. The average of the BIN models after one sampling run is shown in Fig. 26.

Fig. 26
figure 26

Clustered data set: average over direction sets of BIN models; values of BIN models are constant on the binary sets

1.4 Result Figures

Figures illustrating four small two dimensional data sets are shown: the four point xor data set, the eight by eight generalized xor data set, a synthetic movie ratings data set, and a clustered data. The color-coding indicates the function values on the data set.

Following the figure for each data set are figures showing the values of the sequential variation function(s) for each generic direction (i.e. each point of the unit circle) and non-generic points on the unit circle for the data set. In all of the graphs, the intervals where sequential variation is minimized are indicated by color-coding. The points on the unit circle corresponding to non-generic (NG) directions, which form the boundaries of the intervals where the sequential variation assumes constant values are also shown and colored in black .

For each data set, the sequential variation figures are followed by value figures for a sample of each of three types of models for each data set. The values are computed for a fine grid in a small rectangle containing the data set and shown via color-coding corresponding to the key on the right hand side of the figure. All of the models interpolate the data sets. The original data set points are graphed as black dots. For some data sets, a test point, graphed as a large black diamond is shown in each model figure. This enables visual checking of the consistency of the sample models’ predictions. (The predictions would be more consistent by sampling many times and averaging, for each model type.)

The figures are explained and referenced in sections “Description of Sequential Variation Results” and “Description of Model Results” and captions are also intended to be informative.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Authors and the Association for Women in Mathematics

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ness, L. (2021). Fitting Small Piece-Wise Linear Neural Network Models to Interpolate Data Sets. In: Demir, I., Lou, Y., Wang, X., Welker, K. (eds) Advances in Data Science. Association for Women in Mathematics Series, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-79891-8_7

Download citation

Publish with us

Policies and ethics