Abstract
The simplest neural network models for functions are compositions of layer functions, where each layer function applies a nonlinear activation function σ, which is not a polynomial, component-wise to an affine linear functions WX + B. The rows of W are referred to as ‘weights’ and B is the ‘bias vector’. The Universal Approximation Theorem for Neural Networks implies that any continuous function on a compact set can be approximated arbitrarily precisely using this class of models. This rich class of functions is proving to be remarkably useful and underpins the explosion of artificial intelligence applications. Neural network model architectures typically use compositions of many layer functions defined in terms of a huge number of parameters. The parameters are typically fit by minimizing a loss function using gradient descent. Mathematically, one is seeking a representation computed by a composition of all but the final layer function that accurately predicts the function values using the final layer function (essentially an affine linear function). However, the use of many layers in combination with gradient descent results in models which are not easy to understand and visualize. This makes it difficult for mathematicians and engineers who are interested in learning more about neural networks and their mathematical underpinnings to acquire an initial concrete understanding. In this paper we focus on explicitly demonstrating the universal approximation property of neural network models for the most commonly used activation function, the Relu function. Relu models provide closed form expressions for a huge class of piece-wise linear functions, a class of functions not typically studied. We exploit the particular nature of Relu activation function to define three small piece-wise linear neural network models which interpolate a given finite set of example training and label data while having a minimal number of linear pieces. We avoid the use of gradient descent to fit the models by proving for these simple models that the interpolation equations for the top layer weight matrix W and bias B have a simple closed form algebraic solution, providing the first layer weights satisfy a mild generic property and the biases are defined in a particular data dependent manner. We also prove that if the weights are further restricted to be norm one and minimize a sequential variation property, the piecewise-line models avoid over-fitting in the sense that they have a minimal number of pieces. We prove that sequential variation is constant on a finite number of sectors and hence easy to minimize and sample. The number of sectors is determined by the number of data points, not the dimension, avoiding the ‘curse of dimensionality’. The minimum sequential variation distribution characterizes the distribution parameters of all of the simplest models, enabling computation of average models. This is not possible when gradient descent is used. One of our models adds an initial layer which uses the above concepts to construct a small binary tree representation, representing the training data in binary coordinates. The models can be simply computed using elementary mathematical operations for small and possibly large data sets. The concepts and models are illustrated by graphing results on small two-dimensional data sets. We hope that the example models will provide some concrete insights into the Relu neural network model for representing functions. No knowledge of neural network models or deep learning methods is assumed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arora, S., Du, S., Hu, W., Li, Z., Wang, R.: Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks. In ICML (2019).
Arora, S., Du, S., Hu, W., Salakhutdinov, R., Wang, R.: On exact computation with an inifinitely wide neural net. In NeurIPS (2019).
Bach, F.: Breaking the Curse of Dimensionality with Convex Neural Networks. J Mach Learn Res 18, 1–53 (2017).
Balestriero, R.,Baraniuk, R. : A spline theory of deep networks. Proc. Int. Conf. Mach. Learn. (ICML’18), Jul. 2018.
Balestriero,R., Baraniuk, R.: Mad Max: Affine Spline Insights into Deep Learning. arXiv:1805.06576.
Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., Marcotte, P. : Convex neural networks. In NIPS (2005).
Chizat, L., Bach, F. : On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport. In NIPs (2018).
Cooper, Y. : The loss landscape of overparametrized neural networks. https://arxiv.org/abs/1804.10200
Cybenko, G. : Approximation by superpositions of a sigmoidal function. Math Control Signal, 2, 303–314 (1989).
Du, S., Zhai, X., Poczos, B., Singh, A.: Gradient descent provably optimizes over-parameterized neuralnetworks. arXiv preprint arXiv:1810.02054 (2018).
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, 1984.
Guilhoto, L.F.: An Overview Of Artificial Neural Networks for Mathematicians. http://math.uchicago.edu/~may/REU2018/REUPapers/Guilhoto.pdf.
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT press (2016).
Haeffele, B.D., Vidal, R.: Global Optimality in Neural Network Training. CVPR (2017).
Hornik, K., Stinchcombe, M., White, H: Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366 (1989).
Jacot, A., Gabriel, F., Hongler, C. : Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolutional Neural Networks. In NIPS(2012).
Kunin, D., Bloom, J., Goeva, A., Seed, C.: Loss Landscapes of Regularized Linear Autoencoders. PMLR 97:3560-3569 (2019). http://proceedings.mlr.press/v97/kunin19a.html
Klusowski,J.: Sparse Learning with CART. NeurIPS 2020.
Le, Q.: A Tutorial on Deep Learning Part 1: Nonlinear Classifiers and The Backpropagation Algorithm. http://ai.stanford.edu/~quocle/tutorial1.pdf.
Leshno, M., Lin, V.Y., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867 (1993).
Lin,J.,Zhong,C., Hu,D., Rudin,C., Seltzer,M.: Generalized and Scalable Optimal Sparse Decision Trees. ICML 2020.
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What Does BERT Look At? An Analysis of BERT’s Attention. In ACL (2019).
Mei,S., Montanari, A., Montanari, S., Nguyen,A., Nguyen, P. M.: A mean field view of the landscape of two-layers neural networks. arXiv preprint arXiv:1804.06561(2018).
Minsky,M., Papper,S.: Perceptrons: An Introduction to Computational Geometry,MIT Press, (1969).
Needell,D., Ward,R.: Stable Image Reconstruction Using Total Variation Minimization. SIAM J. Imaging Sci., 6(2), 1035–1058.
Ongie, G., Willett,R., Soudry,D., Srebo,N.: A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case. ICLR 2020,arxiv.org/1910.01635.
Senior, A.W., Evans, R., Jumper, J. et al.: Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710,2020.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker , L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Fan, H., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of Go without human knowledge. Nature. 550 (7676): 354–359, 9 October (2017).
Vidal, R.: Mathematics of Deep Learning. http://cis.jhu.edu/~rvidal/talks/learning/Tutorial-Math-Deep-Learning-2018.pdf.
Zhang, D., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In ICLR (2017).
Acknowledgements
The author thanks the National Science Foundation for support under grant number 1934924 to Rutgers University, ICERM for hosting and supporting the July, 2019 WiSDM Research Collaboration Workshop in Mathematica and Data Science, and the National Science Foundation for the 5-year AWM ADVANCE grant, Sept 1, 2015 through August 31, 2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix: Results on Example 2D Data Sets
Appendix: Results on Example 2D Data Sets
In this section we describe four small two-dimensional data sets: the Xor data set, a Generalized Xor data set, a small synthetic movie ratings data set, and a cluster data set. We then describe their sequential variation functions on the unit circle and models for each of the data sets computed from a sample of direction sets. The descriptions reference figures placed in section “Result Figures”. The captions in the figure summarize key points illustrated in the figures.
1.1 Description of Sequential Variation Results
In the figures illustrating sequential variation, the intervals where sequential variation is minimized are indicated by color-coding. The points on the unit circle corresponding to non-generic (NG) directions, which form the boundaries of the intervals where the sequential variation assumes constant values are also shown and colored in black.
The Xor data set is shown in Fig. 1. Its generic directions colored by their sequential variation value are shown in Fig. 2. The four non-generic directions for D are color-coded in black. All-generic directions have the same sequential variation value of two, so no distinguished directions detected. Sequential variation is not well-defined for the non-generic directions.
The generalized Xor data set on an eight by eight grid and its color-coded values is shown in Fig. 6. Its generic directions color-coded by their sequential variation values are shown in Fig. 7. This is a challenging data set because the function values of a point cannot be predicted from the values of its nearest neighbors. The non-generic directions are color-coded in black. Sequential variation is not constant for this data set. There are four disjoint intervals of minimum SV directions colored in dark blue (two symmetric pairs, paired by projection in the origin).
A small synthetic movie ratings data set is shown in Fig. 12. This data set was obtained from the deep learning tutorial [20]. The data points are two people’s ratings for twelve movies. A third person’s Like and Dislike values (blue and red, respectively) are shown for eleven of twelve movie data points. The remaining movie data point is shown as a large black diamond, to indicate that its like/dislike value needs to be predicted by a model. From the graph we see that there are a whole swath of lines that separate the liked movies from the disliked movies. In fact, this is corroborated by one pair of intervals where the sequential variation is minimized. It also appears the test point might be closer to the liked movies. We can use this observation to sanity check the model predictions.
A clustered data set and its color-coded values is shown in Fig. 17. The color-coding is Red = 1, Green = 2, Blue = 3, Magenta = 4. Two clusters are green (value 2); there is only one cluster for each of the other colors (values). For this data set there are four very small direction intervals (two pairs) where sequential variation is minimized and a lot of non-generic points. These are the small intervals of directions where all of the projected clusters are separated. Figure 18 shows the four small color-coded minimum SV direction intervals (two pairs of opposites) and shows the non-generic points in black.
However, the geometry of the clustered data set indicates that it is much easier to find lines that separate cluster(s) with one value from all of the others. This motivates computation of the sequential variation functions for boolean functions y v = 1v(y(D) associated with each value. The cluster data set function y has four values 1,2,3, and 4. The intervals where sequential variation is minimized for the functions y 1, y 2, y 3, and y 4 are shown in Figs. 19, 20, 21 and 22. These figures show that for the cluster data set, the minimum SV (y v) intervals are larger for each of the individual values v than for SV (y). Projections of the data set on an SV minimizing direction for SV (y v) optimally separate the data points with value v from the other data points.
1.2 Description of Model Results
The models for each data set are computed on a sample of finely spaced grid points from a rectangle containing each data set and color-coded by model value. The given training data set is graphed as black points. In some cases, a single test value is graphed as a large diamond data point to enable visual comparison of the consistency of the values of the different models for one data point. All models interpolate the given data set. To reduce the number of models graphed, and to given a feeling for the different geometry of the models, the average of each type of model over the direction set samples is often shown.
1.2.1 Model Results for the Xor Data Set
The xor data set has one direction set (because our algorithm coalesces adjacent SV minimizing intervals with the same SV value into one interval). A sample of a Two Layer One Weight Model, a Two Layer Sum Model and a Three Layer Binary Model are shown in Figs. 3, 4, and 5. All of these interpolate the xor data set.
1.3 Model Results for the Generalized Xor Data Set
The generalized Xor data set is graphed in Fig. 6. Figure 7 shows that there are two pairs of intervals where sequential variation is minimized. Hence sampling from each pair of the intervals results in two Two Layer One Weight (2L1W) models representative of all 2L1W models using a weight which minimizes sequential variation. The graphs of these two models are shown in Figs. 8 and 9. Geometrically, 2L1W models generalize in strips perpendicular to their generic weight direction.
There is one type of Two Layer Sum model for GXOR for each of the four direction sets. The direction sets are computed by minimizing the sequential variation for the associated functions y v. Since the GXOR function has only two binary values, the associated value functions simplify: y v = y for v = 1 and y v = mod((1 + y, 2) for v = 0. Sampling from them is equivalent to sampling twice from each of the intervals minimizing sequential variation for GXOR or twice from the pair of intervals minimizing sequential variation for GXOR. The average of the four resulting Two Layer Sum models is shown in Fig. 10
There is one type of Three Layer Binary Model (BIN) for each Direction Set and each type of minimizing sequential variation direction for the resulting binary coordinates. The average of the BIN models after one sampling run is shown in Fig. 11.
1.3.1 Model Graphs for the Synthetic Movie Ratings Data Set
The small synthetic movie ratings data set is graphed in Fig. 12. A test point is graphed as a black diamond. Figure 13 shows that there is one pair of intervals where the sequential variation is minimized. This intuitively agrees with the geometry of the data set. Sampling finds the weight for a representative Two Layer One Weight model. The values of the 2L1W model for a rectangle containing the points are shown in Fig. 14. There is only one type of Two Layer Sum model since there is only one direction set, whose two weights are obtained by sampling twice from the pairs of intervals where sequential variation is minimized. The values of a Two Layer Sum model for a rectangle containing the data points is shown in Fig. 15
There is one type of Three Layer Binary model for the synthetic move ratings data set shown in Fig. 16.
The color codings for each of the graphed models enables comparison of the predicted values for the test points (graphed as a diamond). While the three types of average models (2L1W, 2LS, and BIN) for different samples varying in their predictions of the test point, an average of the average models over two hundred samples showed consistent predictions. Specifically the averages for 2L1W, 2LS, and BIN, were 0.6441, 0.6492, and 0.7433, respectively. Thus the person assigning the likes and dislikes to the ratings is more likely to like the movie represented by the test point. This is consistent with natural guess from the data set geometry in Fig. 12.
1.3.2 Model Graphs for the Cluster Data Set
The cluster data set is graphed in Fig. 17. Figure 18 shows the two pairs of small intervals where the sequential variation is minimized. Thus there are two types of Two Layer One Weight Models. Values for exemplars of the two types of TL1W models (using weights sampled from the two pairs of intervals) are graphed in Figs. 23 and 24.
There are two types of Two Layer Sum models for the clustered data set because there are two direction sets. Recall that each direction set has an interval minimizing sequential variation for y v for each value v of y. Here y has four values. Figures 19, 20, 21, 22 show that values one, three, and four each have only one pair of minimizing intervals, while value two has two pairs of minimizing intervals. Hence there are only two direction sets (Figs. 23 and 24). Figure 25 shows the average of the two Two Layer Sum models for the clustered data set using weights sampled from the intervals for the direction sets.
There are 22 types of Three Layer Binary Models for the clustered data set. The average of the BIN models after one sampling run is shown in Fig. 26.
1.4 Result Figures
Figures illustrating four small two dimensional data sets are shown: the four point xor data set, the eight by eight generalized xor data set, a synthetic movie ratings data set, and a clustered data. The color-coding indicates the function values on the data set.
Following the figure for each data set are figures showing the values of the sequential variation function(s) for each generic direction (i.e. each point of the unit circle) and non-generic points on the unit circle for the data set. In all of the graphs, the intervals where sequential variation is minimized are indicated by color-coding. The points on the unit circle corresponding to non-generic (NG) directions, which form the boundaries of the intervals where the sequential variation assumes constant values are also shown and colored in black .
For each data set, the sequential variation figures are followed by value figures for a sample of each of three types of models for each data set. The values are computed for a fine grid in a small rectangle containing the data set and shown via color-coding corresponding to the key on the right hand side of the figure. All of the models interpolate the data sets. The original data set points are graphed as black dots. For some data sets, a test point, graphed as a large black diamond is shown in each model figure. This enables visual checking of the consistency of the sample models’ predictions. (The predictions would be more consistent by sampling many times and averaging, for each model type.)
The figures are explained and referenced in sections “Description of Sequential Variation Results” and “Description of Model Results” and captions are also intended to be informative.
Rights and permissions
Copyright information
© 2021 The Authors and the Association for Women in Mathematics
About this chapter
Cite this chapter
Ness, L. (2021). Fitting Small Piece-Wise Linear Neural Network Models to Interpolate Data Sets. In: Demir, I., Lou, Y., Wang, X., Welker, K. (eds) Advances in Data Science. Association for Women in Mathematics Series, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-030-79891-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-79891-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-79890-1
Online ISBN: 978-3-030-79891-8
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)