Constructive lower bounds on model complexity of shallow perceptron networks

Kůrková, Věra

doi:10.1007/s00521-017-2965-0

Constructive lower bounds on model complexity of shallow perceptron networks

EANN 2016
Published: 26 April 2017

Volume 29, pages 305–315, (2018)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Věra Kůrková ORCID: orcid.org/0000-0002-8181-2128¹

423 Accesses
13 Citations
Explore all metrics

Abstract

Limitations of shallow (one-hidden-layer) perceptron networks are investigated with respect to computing multivariable functions on finite domains. Lower bounds are derived on growth of the number of network units or sizes of output weights in terms of variations of functions to be computed. A concrete construction is presented with a class of functions which cannot be computed by signum or Heaviside perceptron networks with considerably smaller numbers of units and smaller output weights than the sizes of the function’s domains. A subclass of these functions is described whose elements can be computed by two-hidden-layer perceptron networks with the number of units depending on logarithm of the size of the domain linearly.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilayer Perceptron (MLP)

On Logical Inference over Brains, Behaviour, and Artificial Neural Networks

Article Open access 13 February 2023

On the Existential Arithmetics with Addition and Bitwise Minimum

References

Ba LJ, Caruana R (2014) Do deep networks really need to be deep? In: Ghahrani Z et al (eds) Advances in neural information processing systems, vol 27, pp 1–9
Barron AR (1992) Neural net approximation. In: Narendra K (ed) Proceedings 7th Yale workshop on adaptive and learning systems, pp 69–72. Yale University Press
Barron AR (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans Inf Theory 39:930–945
Article MathSciNet MATH Google Scholar
Bengio Y (2009) Learning deep architectures for AI. Foundations and Trends in Machine Learning 2:1–127
Article MATH Google Scholar
Bengio Y, Delalleau O, Roux NL (2006) The curse of highly variable functions for local kernel machines. In: Advances in neural information processing systems 18, pp 107–114. MIT Press
Bengio Y, LeCun Y (2007) Scaling learning algorithms towards AI. In: Bottou LO, Chapelle D, DeCoste, Weston J (eds) Large-Scale Kernel Machines. MIT Press
Bianchini M, Scarselli F (2014) On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE Trans Neural Netw Learning Syst 25(8):1553–1565
Article Google Scholar
Candès EJ (2008) The restricted isometric property and its implications for compressed sensing. C R Acad Sci Paris I 346:589–592
Article MATH Google Scholar
Candès EJ, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Process 51:4203–4215
Article MathSciNet MATH Google Scholar
Cover T (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron Comput 14:326–334
Article MATH Google Scholar
Erdös P, Spencer JH (1974) Probabilistic methods in combinatorics. Academic Press
Fine TL (1999) Feedforward neural network methodology. Springer, Berlin Heidelberg
MATH Google Scholar
Gnecco G, Sanguineti M (2011) On a variational norm tailored to variable-basis approximation schemes. IEEE Trans Inf Theory 57:549–558
Article MathSciNet MATH Google Scholar
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554
Article MathSciNet MATH Google Scholar
Ito Y (1992) Finite mapping by neural networks and truth functions. Mathematical Scientist 17:69–77
MathSciNet MATH Google Scholar
Kainen PC, Kůrková V, Sanguineti M (2012) Dependence of computational models on input dimension: tractability of approximation and optimization tasks. IEEE Trans Inf Theory 58:1203–1214
Article MathSciNet MATH Google Scholar
Kainen PC, Kůrková V, Vogt A (1999) Approximation by neural networks is not continuous. Neurocomputing 29:47–56
Article Google Scholar
Kainen PC, Kůrková V, Vogt A (2000) Geometry and topology of continuous best and near best approximations. J Approx Theory 105:252–262
Article MathSciNet MATH Google Scholar
Kainen PC, Kůrková V, Vogt A (2007) A Sobolev-type upper bound for rates of approximation by linear combinations of Heaviside plane waves. J Approx Theory 147:1–10
Article MathSciNet MATH Google Scholar
Kecman V (2001) Learning and soft computing. MIT Press, Cambridge
MATH Google Scholar
Kůrková V (1997) Dimension-independent rates of approximation by neural networks. In: Warwick K, Kárný M (eds) Computer-intensive methods in control and signal processing. The curse of dimensionality, pp 261–270. Birkhäuser, Boston
Kůrková V (2008) Minimization of error functionals over perceptron networks. Neural Comput 20:250–270
MathSciNet MATH Google Scholar
Kůrková V (2012) Complexity estimates based on integral transforms induced by computational units. Neural Netw 33:160–167
Article MATH Google Scholar
Kůrková V (2016) Lower bounds on complexity of shallow perceptron networks. In: Jayne C, Iliadis L (eds) Engineering applications of neural networks. Communications in computer and information sciences, vol 629, pp 283–294. Springer
Kůrková V, Kainen PC (1994) Functionally equivalent feedforward neural networks. Neural Comput 6 (3):543–558
Article Google Scholar
Kůrková V, Kainen PC (1996) Singularities of finite scaling functions. Appl Math Lett 9(2):33–37
Article MathSciNet MATH Google Scholar
Kůrková V, Kainen PC (2014) Comparing fixed and variable-width Gaussian kernel networks. Neural Netw 57:23–28
Article MATH Google Scholar
Kůrková V, Sanguineti M (2002) Comparison of worst-case errors in linear and neural network approximation. IEEE Trans Inf Theory 48:264–275
Article MathSciNet MATH Google Scholar
Kůrková V, Sanguineti M (2008) Approximate minimization of the regularized expected error over kernel models. Math Oper Res 33:747–756
Article MathSciNet MATH Google Scholar
Kůrková V, Sanguineti M (2016) Model complexities of shallow networks representing highly varying functions. Neurocomputing 171:598–604
Article Google Scholar
Kůrková V, Savický P, Hlaváčková K (1998) Representations and rates of approximation of real-valued Boolean functions by neural networks. Neural Netw 11:651–659
Article Google Scholar
LeCunn Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Article Google Scholar
MacWilliams F, Sloane NJA (1977) The theory of error-correcting codes. North-Holland, Amsterdam
Maiorov V, Pinkus A (1999) Lower bounds for approximation by MLP neural networks. Neurocomputing 25:81–91
Article MATH Google Scholar
Maiorov VE, Meir R (2000) On the near optimality of the stochastic approximation of smooth functions by neural networks. Adv Comput Math 13:79–103
Article MathSciNet MATH Google Scholar
Mhaskar HN, Liao Q, Poggio T (2016) Learning functions: when is deep better than shallow. Center for brains, minds & machines CBMM Memo No. 045v3, pp 1–12
Mhaskar HN, Liao Q, Poggio T (2016) Learning functions: when is deep better than shallow. Center for brains, minds & machines CBMM Memo No. 045v4, pp 1–12
Sloane NJA A library of Hadamard matrices. http://www.research.att.com/njas/hadamard/
Sussman HJ (1992) Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Netw 5(4):589–593
Article Google Scholar
Sylvester J (1867) Thoughts on inverse orthogonal matrices, simultaneous sign successions, and tessellated pavements in two or more colours, with applications to Newton’s rule, ornamental tile-work, and the theory of numbers. Phil Mag 34:461– 475
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by the Czech Grant Agency grant GA15-18108S and institutional support of the Institute of Computer Science RVO 67985807.

Author information

Authors and Affiliations

Institute of Computer Science, Czech Academy of Sciences, Pod Vodárenskou věží 2, 182 07, Prague, Czech Republic
Věra Kůrková

Authors

Věra Kůrková
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Věra Kůrková.

Ethics declarations

Conflict of interest

The author declares that she has no conflict of interests.

Appendix

Proof Proof of Lemma 1

Choose an expression of g ∈ P _d(X)as g(z) = sign(a ⋅ z + b),where $z= (x,y) \in \mathbb {R}^{d_{1}} \times \mathbb {R}^{d_{2}}$,$a \in \mathbb {R}^{d} = \mathbb {R}^{d_{1}} \times \mathbb {R}^{d_{2}}$,and $b \in \mathbb {R}$.Let a _land a _rdenote the left and the right part, resp, of a, i.e., a _{l
i} = a _ifor $i=1,\dots , d_{1}$ and a _{r
i} = a _d ₁ + i for $i=1,\dots ,d_{2}$.Then, sign(a ⋅ z + b) = sign(a _l ⋅ x + a _r ⋅ y + b).Let ρ and κ be permutations of the set $\{1, \dots ,n\}$such that $a_{l} \cdot x_{\rho (1)} \leq a_{l} \cdot x_{\rho (2)}\leq {\dots } \leq a_{l} \cdot x_{\rho (n)}$and $a_{r} \cdot y_{\kappa (1)} \leq a_{r} \cdot y_{\kappa (2)} \leq {\dots } \leq a_{r} \cdot y_{\kappa (n)}$.

Denote by M(g)^∗the matrix obtained from M(g)by permuting its rows and columns by ρ and κ,resp. It follows from the definition of the permutations ρ and κ that each row and each column of M(g)^∗starts with a (possibly empty) initial segment of − 1’sfollowed by a (possibly empty) segment of + 1’s. □

Proof Proof of Theorem 5

By Theorem 2,

$$ \| f_{M} \|_{P_{d}(X)} \geq \frac{\| f_{M}\|^{2}}{\sup_{g \in P_{d}(X)} |\langle f_{M}, g\rangle|}{}={}\frac{n^{2}}{\sup_{g \in P_{d}(X)} |\langle f_{M}, g\rangle}|. $$

(5)

The inner product of f _Mwith g is equal tothe sum of entries of the matrices M and M(g),i.e., $\langle f_{M}, g\rangle = {\sum }_{i,j}^{n} M_{i,j} M(g)_{i,j}$,and so it is invariant under permutations of rows and columns performed simultaneously on both matrices M and M(g).

Thus, without loss of generality, we can assume that each row and each column of M(g)starts with a (possibly empty) initialsegment of − 1’s followed by a (possiblyempty) segment of + 1’s. Otherwise, wereorder rows and columns in both matrices M(g)and M applying permutations from Lemma 1.

To estimate $\langle f_{M}, g\rangle = {\sum }_{i,j=1}^{n} M_{i,j} M(g)_{i,j},$we define apartition of the matrix M(g)into a family of submatrices such that each submatrix from the partition of M(g)has either all entriesequal to − 1or all entriesequal to + 1(see Fig. 1). Weconstruct the partition of M(g)as a union of sequence of families of submatrices (possibly some of them empty)

$$\mathcal{P}(g,k) = \{P(g,k,1), \dots, P(g,k,2^{k})\}, \quad k=1, \dots, \lceil \log_{2} n\rceil, $$

definedrecursively. To construct it, we also define an auxiliary sequence of families of submatrices

$$\mathcal{Q}(g,k)= \{Q(g,k,1), \dots, Q(g,k,2^{k})\},\! \quad k=1, \dots, \lceil \log_{2} n\rceil, $$

such thatfor each k,

$$\{P(g,k,1), \dots, P(g,k,2^{k}), Q(g,k,1), \dots, Q(g,k,2^{k})\} $$

is a partition of thewhole matrix M(g).

First, we define $\mathcal {P}(g,1)= \{P(g,1,1), P(g,1,2)\}$and $\mathcal {Q}(g,1)= \{Q(g,1,1), \ Q(g,1,2)\}$. Let r _1,1and c _1,1be such that thesubmatrix P(g, 1, 1)of M(g)formed by the entriesfrom the first r _1,1rowsand the first c _1,1columnsof M(g)has all entriesequal to − 1and thesubmatrix P(g, 1, 2)by theentries from the last r _1,2 = n − r _1,1rows and the last c _1,2 = n − c _1,1of M(g)has all entriesequal to + 1. Let Q(g, 1, 1)be the submatrixformed by the last r _1,2 = n − r _1,1rows and the first c _1,1columns of M(g)and Q(g, 1, 2)be the the submatrixformed by the first r _1,2rows and the last c _1,2 = n − c _1,1columns. So {P(g, 1, 1),P(g, 1, 2),Q(g, 1, 1),Q(g, 1, 2)}isa partition of M(g)into four submatrices (see Fig. 1).

Now, assume that the families $\mathcal {P}(g,k)$and $\mathcal {Q}(g,k)$are constructed.To define $\mathcal {P}(g,k+1)$and $\mathcal {Q}(g,k+1)$, we divideeach of 2^ksubmatrices Q(g,k,j),$j=1, \dots , 2^{k}$into foursubmatrices: P(g,k + 1, 2j − 1), P(g,k + 1, 2j), Q(g,k + 1, 2j − 1), and Q(g,k + 1, 2j)such that each of thesubmatrices P(g,k + 1, 2j − 1)has allentries equal to − 1and eachof the submatrices P(g,k + 1, 2j)hasall entries equal to + 1.

Iterating this construction at most $\lceil \log _{2} n \rceil $times, we obtain a partition of M(g)formedby the union of families of submatrices $\mathcal {P}(g,k)$.It follows from the construction that for each k, the sum of the numbers of rows$\{r_{k,t} \, | \, t=1, \dots , 2^{k}\}$and the sum of thenumbers of columns $\{ c_{k,t} \, | \,t=1, \dots , 2^{k}\}$of these submatrices satisfy

$$\sum\limits_{t=1}^{2^{k}} r_{k,t} =n \quad \text{and} \quad \sum\limits_{l=1}^{2^{k}} c_{k,t} =n. $$

Let $\mathcal {P}(k) = \{ P(k,1), \dots , P(k,2^{k})\}$bethe family of submatrices of M formed by the entries from the same rows and columns as corresponding submatrices from the family$\mathcal {P}(g,k) = \{ P(g,k,1), \dots , P(g,k,2^{k})\}$of submatricesof M(g).

To derive an upper bound on |〈f _M,g〉|,we express it as

$$\begin{array}{@{}rcl@{}} \left|\langle f_{M}, g\rangle\right| &=& \left|{\sum}_{i,j}^{n} M_{i,j} {M(g)}_{i,j} \right|\\ &=&\left|\sum\limits_{k=1}^{\lceil \log_{2} n \rceil}\sum\limits_{t=1}^{2^{k}} {P(k,t)}_{i,j} \, {P(g,k,t)}_{i,j} \right|. \end{array} $$

(6)

As all the matrices P(k,t)are submatrices of the Hadamard matrix M, by the Lindsay Lemma 2 for each submatrix P(k,t),

$$\left | \sum\limits_{i=1}^{r_{k,t}} \sum\limits_{j=1}^{c_{k,t}} {P(k,t)}_{i,j} \right | \leq \sqrt{ n \,r_{k,t} \, c_{k,t} }. $$

All the matrices P(g,k,t)have all entrieseither equal to 1 or all entries equal to − 1.Thus,

$$\left |\sum\limits_{i=1}^{r_{k,t}} \sum\limits_{j=1}^{c_{k,t}} {P(k,t)}_{i,j} \, {P(g,k,t)}_{i,j} \right | \leq \sqrt{ n \,r_{k,t} \, c_{k,t}}. $$

As for all k, ${\sum }_{t=1}^{2^{k}} r_{k,t} =n$and ${\sum }_{t=1}^{2^{k}} c_{k,t} =n$, weobtain by the Cauchy-Schwartz inequality

$$\sum\limits_{t=1}^{2^{k}} \sqrt{r_{k,t} \, c_{k,t}} \leq n. $$

Thus, foreach k,

$$\sum\limits_{t=1}^{2^{k}} | {P(k,t)}_{i,j}\, {P(g,k,t)}_{i,j} | \leq \sum\limits_{t=1}^{2^{k}} \sqrt{n \, r_{k,t} \, c_{k,t}} \leq n \sqrt{n}.$$

Hence, by (6),

$$\left|\langle f_{M}, g\rangle \right | \leq \sum\limits_{k=1}^{\lceil \log_{2} n \rceil} \left| \sum\limits_{t=1}^{2^{k}} {P(k,t)}_{i,j}\, {P(g,k,t)}_{i,j} \right| \leq n \sqrt{n}\, \lceil \log_{2} n\rceil.$$

So by (5),

$$\| f_{M} \|_{P_{d}(X)} \geq \frac{n^{2}}{n \,\sqrt{n} \, \lceil \log_{2} n \rceil } \geq \frac{ \sqrt{n}}{\lceil \log_{2} n\rceil}$$

□

Proof Proof of Theorem 6

Any 2^k × 2^kSylvester-Hadamardmatrix S(k)is equivalent tothe matrix M(k)with rows andcolumns indexed by vectors u,v ∈{0, 1}^kand entries

$$M(k)_{u,v} = -1^{u \cdot v} $$

(see, e.g., [33]). Thus, without lossof generality, we can assume that S(k)_u,v = −1^u⋅v(otherwise, we permute rows and columns).

To represent the function $h_{k}: \{0,1\}^{k} \times \{0,1\}^{k} \to \{-1,1\}$by a two-hidden-layer network, we first define k Heaviside perceptrons from the first hidden layer. Choose any bias b ∈ (1, 2)and defineinput weights $c^{i} = (c^{i,l},c^{i,r}) \in \mathbb {R}^{k} \times \mathbb {R}^{k} $,$i=1, \dots , k$, as$c^{il}_{j}= 1$and$c^{ir}_{j} =1$when j = i, otherwise$c^{il}_{j}= 0$and$c^{ir}_{j} =0$. So for an inputvector x = (u,v) ∈{0, 1}^k ×{0, 1}^k, the output y _i(x)of the i-th perceptron in thefirst hidden layer satisfies y _i(x) = 𝜗(c ⁱ ⋅ x − b) = 1if and only if both u _i = 1and v _i = 1;otherwise, y _i(x)isequal to zero.

Let w = (w ₁,…,w _k)be suchthat w _j = 1for all$j =1, \dots , k$. In the second hidden layer,define k perceptrons by z _j(y) := 𝜗(w ⋅ y − j + 1/2).Finally, for all $j=1, \dots , k$,let the j-th unit from the second hidden layer be connected with one linear output unit with the weight (−1)^j.

The two-hidden-layer network obtained in this way computes the function${\sum }_{j=1}^{k} (-1)^{j} \vartheta (w \cdot y(x) -j +1/2)$, where y _i(x) = 𝜗(c ⁱ ⋅ x − b), i.e., it computesthe function ${\sum }_{j=1}^{k} (-1)^{j} \vartheta \left ({\sum }_{i=1}^{d/2} \vartheta (c^{i} \cdot x -b) -j +1/2\right ) = h_{k}(x) = h_{k}(u,v) = -1^{u \cdot v}$. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kůrková, V. Constructive lower bounds on model complexity of shallow perceptron networks. Neural Comput & Applic 29, 305–315 (2018). https://doi.org/10.1007/s00521-017-2965-0

Download citation

Received: 10 November 2016
Accepted: 23 March 2017
Published: 26 April 2017
Issue Date: April 2018
DOI: https://doi.org/10.1007/s00521-017-2965-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructive lower bounds on model complexity of shallow perceptron networks

Abstract

Access this article

Similar content being viewed by others

Multilayer Perceptron (MLP)

On Logical Inference over Brains, Behaviour, and Artificial Neural Networks

On the Existential Arithmetics with Addition and Bitwise Minimum

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Proof Proof of Lemma 1

Proof Proof of Theorem 5

Proof Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Constructive lower bounds on model complexity of shallow perceptron networks

Abstract

Access this article

Similar content being viewed by others

Multilayer Perceptron (MLP)

On Logical Inference over Brains, Behaviour, and Artificial Neural Networks

On the Existential Arithmetics with Addition and Bitwise Minimum

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix

Appendix

Proof Proof of Lemma 1

Proof Proof of Theorem 5

Proof Proof of Theorem 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation