Abstract
In this paper we investigate deep neural networks for binary classification of datasets from geometric perspective in order to understand the working mechanism of deep neural networks. First, we establish a geometrical result on injectivity of finite set under a projection from Euclidean space to the real line. Then by introducing notions of alternative points and alternative number, we propose an approach to design DNNs for binary classification of finite labeled points on the real line, thus proving existence of binary classification neural net with its hidden layers of width two and the number of hidden layers not larger than the cardinality of the finite labelled set. We also demonstrate geometrically how the dataset is transformed across every hidden layers in a narrow DNN setting for binary classification task.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Despite the striking success of deep neural networks in applications [7], the deep neural networks remain to be a pile of techniques to some extent, like “Jack of all trades” and playing with black boxes [1, 7, 8]. Therefore, one of outstanding challenges is to understand the working mechanism of the deep neural networks (DNNs) and develop a solid mathematical framework for understanding and explaining the effectiveness and failures of deep neural networks, which, in general, is a sequential function composition. In a designed DNN net, knowing how the datasets are transformed through each layer of the DNN is of significance in understanding the working mechanism of DNNs. This remains to be a great challenge and we are trying to take a step forward along this line of research in the setting of binary classification in the present paper, with the motivation that the research could shed light on interpreting some unexpected failures, such as shortcut learning in DNNs.
Up to now, the related work mainly focus on the so called expressivity [2,3,4,5,6, 9, 10, 12,13,14], which involve the ability of DNNs to produce accurate function approximations, all from the perspective of approximation theory. Some research is also on the investigation of complexity of DNNs [10, 12] in terms of decision regions and decision boundaries, thus give new insight regarding the advantage of depth for neural networks with piecewise linear activation functions. Recently, some authors pay more attention to connection of width-depth with expressivity [5, 6, 12], i.e., the ability of DNNs in approximation of continuous functions. In particular, upper and lower bounds of widths for a DNN to have approximation ability of continuous functions were obtained in [5, 6].
Unlike the aforementioned work above, our motivation is to geometrically understand the working mechanism of the deep neural networks (DNNs), therefore we are more concerned with the geometric properties of data in the internal representation across all depth of neural networks and the dynamics of data transformation through the layers of DNNs and try to find answers to general questions such as: how width-depth is related to a neural net in classification task and how the dataset from input space are changed in learning process layer by layer in hidden layers.
Because of sequential maps composition structure of the deep neural networks, it is helpful to investigate the deep neural networks from mapping perspective, and the purpose of this paper is to provide a model to demonstrate how a narrow DNN can carry out binary classification task for finite data set by virtue of mapping perspective. This model provides a geometric demonstration of how a data set is transformed across layers of a DNN for classification task, which may be helpful in understanding the phenomenon of shortcutting learning.
In this paper we are interested in the following question.
Let \(D_{q} \subset R^{n}\) be a dataset containing \(q\) points with two labels, such that \(D_{q} = C_{1} \cup C_{2}\). where \(C_{1}\) is the subset of \(D_{q}\) labelled 1 and \(C_{2}\) the subset of \(D_{q}\) labelled 2.
Question Can we design a deep neural network with its size being as small as possible while classifying the points in the above data set?
This question is of practical significance, and is surely a theoretical challenge from mathematical view of point.
In many applications deep learning uses feedforward neural network architecture, which is a composition of a sequence of maps, to learn mapping an input dataset to the output dataset that is finally implemented by some simple classifying function that plays the role of labelling. A basic idea can be described as follows. To go from one layer to the next, the dataset is computed as a weighted sum of their inputs from the previous layer and pass the result through a non-linear function, called activation function. The most popular activation function is the rectified linear unit (ReLU) considered in the present paper. A compositing map that is not in the input or output layer is conventionally called hidden layer. The hidden layers play the essential role in distorting the input space in a non-linear way so that categories become linearly separable by the last layer, and this is the point toward which all the arguments we will give in the main body of this paper.
In this paper we investigate the deep neural networks for binary classification of datasets from geometric perspective. By establishing a geometrical result on injectivity of finite set under a projection from Euclidean space to the real line and introducing notions of alternative points and alternative number, we prove existence of binary classification DNNs with the hidden layers of width two and the number of hidden layers not larger than the cardinality of the finite labelled set. This is a noticeable fact to the knowledge of the author.
As shown in the sequel, in the proof we also demonstrate geometrically how the dataset is transformed across every hidden layers in a DNN for binary classification task of finite labeled points in any Euclidean space. Moreover, we also provide some models to illustrate how a narrow DNN with width-two hidden layers can carry out binary classification task for finite datasets.
We hope that the results in the present paper may shed light on the mechanism of how DNNs work and contribute to the interpretability of DNNs, so that these results may inspire people to find more economic approaches to the design and development of new architectures of DNNs, despite of not being practical in dealing with practical datasets.
Now we first recall in this section some preliminaries for deep neural networks (DNN) with Relu activations. One typical nonlinear activation function is the rectified linear unit defined as
In the vector form, we have the map \(\sigma_{R} :R^{n} \to R^{n}\) which is component wisely defined as
In DNNs, a function defined in a hidden layer can be written as
A deep neural network with \(\sigma_{R}\) activation is a continuous map \(Net:R^{n} \to R^{m}\),\(n > m\), that is the composition of a finite sequence of functions and is of the form
\(\begin{array}{*{20}c} {A_{i} \left( x \right) = W_{i} x + b_{i} ,} & {W_{i} \in M_{{m_{i} \times n_{i} }}^{{}} ,} & {b_{i} \in R^{{m_{i} }} } \\ \end{array}\).where \(F_{out}\) is a classifying function as defined in next section.
Definition 1.1 Define the hidden layer space \(H_{R}\) as follows. A map \(f:R^{n} \to R^{m}\) satisfies \(f \in H_{R}\), if and only ifwith
The width of hidden layer map of \(f\) is defined to be \(m\), the dimension of \(R^{m}\). That is, the number of neurons in the hidden layer in terms of neural networks. The depth of the net \(Net^{R}\) of is defined to be the number of layers taken from \(H_{R}\), i.e., the number of hidden layers.
According to [1], the size \(Si\) of the above DNN is defined as follows.
Definition 1.2
The size of (1) is.
Definition 1.3
The parameter capacity \(Cp\left( {Net^{R} } \right)\) of \(Net^{R}\) is defined to be the number of parameters of (1) that are adjustable.
Before developing the main theory we set up some notations.
Let \(P_{v} :R^{2} \to L_{v}\) be the projection map, \(L_{v} \subset R^{2}\) is the linear space spanned by \(v\), \(v = \left( {1,1} \right)\).
\(R_{ - }^{2} = \left\{ {x = \left( {x_{1} ,x_{2} } \right) \in R^{2} :x_{1} \le 0,x_{2} \le 0} \right\}\).
Consider two points \(x = \left( {x_{1} ,x_{2} } \right) \in R^{2}\) and \(x = \left( {y_{1} ,y_{2} } \right) \in R^{2}\), we say that \(x < y\) if \(x_{1} < y_{1} ,x_{2} < y_{2}\).
For later arguments, we recall some basic properties of Relu maps.
Consider a two dimensional Relu map \(\sigma_{R} :R^{2} \to R^{2}\). We have the following observations.
-
(1)
\(\sigma_{R} \left( {R^{2} } \right) \subseteq \partial R_{ + }^{2}\)
-
(2)
If \(L \subset R^{2}\) is a linear one dimensional subspace satisfying \(L \cap R_{ + }^{2} \cap {\text{ = \{ 0\} }}\), then \(\sigma_{R} :L \to \partial R_{ + }^{2}\) is a homeomorphism.
-
(3)
Every set \(S \subset R_{ - }^{2}\) satisfies \(\sigma_{R} \left( S \right) = \left\{ 0 \right\}\)
The main result obtained in this paper can be sated as follows.
1.1 The main results
We first prove a geometric fact concerning linear map on finite sets
Result 1 Let \(D \subset R^{n}\) be a finite set. There exists a liner map \(L:R^{n} \to R\) such that \(L\) restricted to \(D\) is injective.
Based on this fact we can prove the following facts about the DNN net for binary classification
problem.
Result 2 Let \(D_{q} \subset R^{n}\) be a compact data set containing \(q\) points with two labels. Then there exists a DNN \(Net\) has the following properties.
-
(1)
The hidden layers of \(Net\) are of width 2.
-
(2)
The hidden layers of \(Net\) are of depth not larger than \(q - 1\)
-
(3)
\(Net\) can classify the data set \(D_{q}\).
Result 3 For the above labelled set, there exist a class of DNN classifiers, each \(Net^{R}\) of it satisfies the following inequality
The rest of this paper is organized as follows. We first give some preliminaries and introduce notions of alternative points and alternative number for labeled sets in real line \(R\) in Sect. 2. Then in Sect. 3 we prove a geometric fact which characterizes a connection between finite sets in a general Euclidean space to finite sets in real line under linear dimensionality reduction map. Then in Sect. 4, based on the alternative number and the geometric fact we set up a DNNs binary classification framework for labeled sets in real line in the setting of Relu activation. With these prepared work we easily come to the conclusion of the main theorem in and Sects. 5 and 6.
2 Alternative Points and Alternative Number of Labeled Set
As in the first section, consider the data set \(D_{q} \subset R^{n}\) containing \(q\) points with two labels, such that \(D_{q} = C_{1} \cup C_{2}\)..
Definition 2.1
If there exists a continuous function \(\xi :R^{n} \to R\) such that.
\(\xi (x) > 0\), \(x \in C_{1}\).
\(\xi (x) < 0\), \(x \in C_{2}\).
Then \(\xi\) is called a classifying function. In particular, \(\xi\) is called a linear classifying function if it is a linear function.
Definition 2.2
Consider a compact set \(D \subset R^{n}\) with two labels, such that \(D = C_{1} \cup C_{2}\). A continuous map \(f:R^{n} \to R^{m}\) is said to be label preserving with respect to \(D\), if it satisfies.
Now consider a finite set \(A_{q} = \left\{ {a_{1} ,...,a_{q} } \right\} \subset R\) with \(a_{1} < a_{2} < ,..., < a_{q}\).
Definition 2.3
Suppose \(A_{q}\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(a_{i}\) and \(a_{i - 1}\) belong to \(C_{1}\) and \(C_{2}\), respectively, then \(a_{i}\) is called an alternative point of \(A_{q}\). The number of alternative points of \(A_{q}\) is called alternative number of \(A\) and denoted by \(altA_{q}\).
Example
Consider \(A = \left\{ {a_{1} ,a_{2} ,a_{3} } \right\} \subset R\). If \(a_{1} ,a_{2} \in C_{1}\) and \(a_{3} \in C_{2}\), then \(A\) has one alternative point \(a_{3}\), and \(alt\;A = 1\). If \(a_{1} ,a_{3} \in C_{1}\) and \(a_{2} \in C_{2}\), then \(A\) has two alternative point \(a_{2}\) and \(a_{3}\), and \(alt\;A = 2\).
The following fact is obvious.
Proposition 2.1 For any \(A_{q} = \left\{ {a_{1} ,...,a_{q} } \right\} \subset R\), we have \(alt\;A_{q} \le q - 1\).
3 A Geometric Fact Concerning Linear Map
First we give a geometric result concerning a linear map defined on a finite set.
Theorem 3.1
Let \(D_{q} = \left\{ {x_{1} ,...,x_{q} } \right\} \subset R^{n}\) be a finite set. There exists a liner map \(L:R^{n} \to R\) such that \(L\) restricted to \(D_{q}\) is injective.
Proof
Consider \(D_{q} \times D_{q}\) and let.
For each pair \(\left( {x_{xi} ,x_{xj} } \right) \in D_{\Delta }\), it is clearly that \(v_{{x_{i} x_{j} }} = x_{{x_{i} }} - x_{{x_{j} }} \ne 0\). Now consider the \(n - 1\) dimensional linear subspace \(S_{{x_{i} x_{j} }}\) defined by \(v_{{x_{i} x_{j} }}\).
\(S_{xixj} = \left\{ {x \in R^{n} : < x,v_{xixj} > = 0} \right\}\), \(< , >\) is the Euclidean inner product.
Let
Then the Lebesgue measure of \(S\) in \(R^{n}\) is zero, so \(R^{n} - S\) is nonempty. Take any vector \(u \in R^{n} - S\), we have
It follows that the linear map defined by \(L\left( x \right) = < u,x > u\) is injective on \(D_{q}\)□.
Remark
The arguments provided in the above proof shows that if \(D \subset R^{n}\) consists countable infinite points, the same statement still holds true.
4 A Binary Classification Theory for Labeled Sets in Real Line
For reader’s convenience to grasp the main idea of the sequel arguments, we first consider the simplest case where \(D_{3} = \left\{ {x_{1} ,x_{2} ,x_{3} } \right\} \subset R^{1}\).
Proposition 4.1
For \(D_{3} = \{ x_{1} ,x_{2} ,x_{3} \} \subset R^{1}\), we have the following observations.
If \(alt\;D_{3} = 1\), then it is trivial to have a linear classifier.
If \(alt\;D_{3} = 2\), then we can have a net with one width-two hidden layer in \(H_{\sigma }\). Without loss of generality, assume that \(\left\{ {x_{1} ,x_{3} } \right\} \subset C_{1}\), \(x_{2} \in C_{2}\).
Proof
Note that \(x_{1} < x_{2} < x_{3}\), we can let \(c = \tfrac{{x_{1} + x_{3} }}{2}\). Then we consider the affine map.
\(A_{1} :R \to R^{2}\) such that.
\(A_{1} \left( R \right) \bot v\), \(A_{1} \left( C \right) = 0 \in R^{2}\).
Now define \(\sigma^{R} \circ A_{1} :R \to R^{2}\), then \(\sigma^{R} \circ A_{1} \left( R \right) \subseteq \partial R_{ + }^{2}\). It is easy to see that on the linear space \(L_{v}\) the map \(f_{1} = P_{v} \circ \sigma^{R} \circ A_{1}\) is label preserving and has the property
It obvious that we can define a linear map \(\overline{\xi }:R^{2} \to R\) to classify \(\left\{ {x_{1} ,x_{3} } \right\}\) and \(x_{2}\). For instance.
define a map.
\(\overline{\xi } = v \cdot x + p\), with \(f_{1} \left( {x_{2} } \right) < p < f_{1} \left( {x_{1} } \right)\).
Define the map \(\xi = \overline{\xi } \circ P_{v}\), the net \(Net = \xi \circ \sigma^{R} \circ A_{1} = \overline{\xi } \circ P_{v} \circ \sigma^{R} \circ A_{1}\)□ is the desired DNN.
Now let us turn to more general case.
Lemma 4.1
Suppose \(A_{q} = \left\{ {x_{1} ,...,x_{q} } \right\} \subset R\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(alt\;A_{q} = 2\), then there exist a DNN classifier with no more than two hidden layers.
Proof
Without loss of generality, suppose \(x_{i}\) and \(x_{j}\),\(i < j\), are two alternative points.
Case 1 \(i = 2,j = 3\). In this case, as shown in Proposition 4.1, we can define the map \(f_{1} = P_{v} \circ \sigma^{R} \circ A_{1}\) is label preserving and has the property
Case 2 \(i > 2,j = i + 1\). First define an affine map \(A_{1} :R \to R^{2}\) such that \(A_{1} \left( R \right) \subseteq Lv\) and.
\(A_{1} \left( {a_{k} } \right) \in R_{ - }^{2}\),\(k = 1,2,...,i - 2\), \(A_{1} \left( {a_{i - 1} } \right) = 0 \in R^{2}\) and \(A_{1} \left( {a_{k} } \right) \in R_{ + }^{2}\) for \(k \ge i\). Then the map \(f_{1} = \sigma^{R} \circ A_{1}\) is label preserving and has the property that \(f_{1} \left( {a_{k} } \right) = 0\) for \(k = 1,2,...,i - 1\) and \(f_{1} \left( {a_{k} } \right) = A_{1} \left( {a_{k} } \right) \in R_{ + }^{2}\) for \(k \ge i\). Moreover,
For this sequence of the points on \(Lv\), we can apply the procedure of Case 1 to define a label preserving map \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2}\), so that
As in Proposition 4.1, we can define a map in the form.
\(\overline{\xi } = v \cdot x + p\), with \(f_{2} \circ f_{1} \left( {x_{i} } \right) < p < f_{2} \circ f_{1} \left( {x_{j} } \right)\).
So \(Net = \xi \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1} = \overline{\xi } \circ P_{v} \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1}\) is a desired classifier.
Case 3 \(i > 2,j > i + 1\). In this general case we will show that two hidden layers are enough to classify the data points. As in Case 2, we can construct a \(f_{1} = \sigma^{R} \circ A_{1}\) so that
Let \(c = \frac{{f_{1} \left( {x_{i - 1} } \right) + f_{1} \left( {x_{j} } \right)}}{2}\) then similar to the arguments for Proposition 4.1,we consider the affine map \(A_{2} :R \to R^{2}\) such that.
\(A_{2} \left( R \right) \bot v\), \(A_{1} (c) = 0 \in R^{2}\).
Now define the label preserving \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2} :R \to R^{2}\), then on the linear space \(L_{v}\) the map \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2}\) has the property.
\(f_{2} \circ f_{1} \left( {x_{i - 1} } \right) = f_{2} \circ f_{1} \left( {x_{j} } \right) < f_{2} \circ f_{1} \left( {x_{j + 1} } \right) < ..., < f_{2} \circ f_{1} \left( {x_{q} } \right)\).
In addition, \(f_{2} \circ f_{1} \left( {x_{k} } \right) < f_{2} \circ f_{1} \left( {x_{i - 1} } \right)\) for \(k = i,...,j - 1\).
Thus we have only one alternative point \(b_{j} = f_{2} \circ f_{1} \left( {x_{i - 1} } \right) = f_{2} \circ f_{1} \left( {x_{j} } \right)\) in the new sequence of the points on \(L_{v}\) satisfying \(b_{j} < f_{2} \circ f_{1} \left( {x_{j + 1} } \right) < ..., < f_{2} \circ f_{1} \left( {x_{q} } \right)\). And the map of the form.
\(\overline{\xi } = v \cdot x + p\), with \(\mathop {\max }\limits_{k = i,...,j - 1} f_{2} \circ f_{1} (x_{i} ) < p < b_{j}\).
Classifies the new sequence of the points. Finally, let
which is a desired classifier for \(A_{q}\) and has two hidden layers from \(H_{R}\)□.
Now we can obtain a more general statement.
Proposition 4.2
Suppose \(A_{q} \subset R\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(alt\;A_{q} = h\), then there exist a DNN classifier with no more than \(2(h - 1)\) hidden layers from \(H_{R}\) with width 2..
Proof
It is enough to show that we can construct a DNN net that maps the set \(A_{q}\) to a finite set on \(L_{v}\) having only one alternative point. Let \(x_{i1} ,...,x_{ih} \in A_{q}\) are the alternative points satisfying \(x_{i1} < ,..., < x_{ih}\). From the previous arguments, it is easy to see that there is a DNN Net with no more than two hidden layers from \(H_{R}\) such that.
\(Net_{1} \left( {x_{k} } \right) = Net_{1} \left( {x_{i2} } \right),\)\(k = 1,2,...,i_{1} - 1\),
\(Net_{1} \left( {x_{k} } \right) < Net_{1} \left( {x_{i2} } \right),\)\(k = i_{1} ,i_{1} + 1,...,i_{2} - 1\),
and
Therefore we have a new image sequence of points on \(L_{v}\) with \(h - 1\) alternative points. Continue this way by constructing a new \(Net_{2}\) with at most two hidden layers from \(H_{R}\), we obtain again a new sequence of points on \(L_{v}\) with \(h - 2\) alternative points. After a series of constructions of nets \(Net_{3} ,...,Net_{h - 1}\), each having at least two hidden layers from \(H_{R}\), we finally have a sequence of points with only one alternative point. Now it is easy to see that the composition map
has the property that \(Net\left( {A_{q} } \right)\) is a sequence of points on \(L_{v}\) having only one alternative point.
Remark
In view of the definition of DNNs in (1) with \(F_{i} (x) = {\text{Re}} lu \circ A_{i} (x)\) and.
we can rewrite for every hidden layer \(A_{i} (x) = W_{i} \circ P_{v} x + b_{i}\), so that (4) can be expressed as the form (1) with \(A_{i} (x) = \overline{W}_{i} x + b_{i} = W_{i} \circ P_{v} x + b_{i}\).
Note that \(A_{q}\) has at most \(q - 1\) alternative points. If \(A_{q}\) has exactly \(q - 1\) alternative points, then this implies that there is no point in \(A_{q}\) that lies between two successive alternative points. It is easy to see that to vanish once a time an alternative point from left, one hidden layer from \(H_{R}\) is enough for this vanishing in view of Proposition 4.1. If there exist \(h\) points between two successive alternative points, then \(A_{q}\) has at most \(q - 1 - h\) alternative points. Take these observations into account, we have the following fact.
Theorem 4.2
Suppose \(A_{q}\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\)., then there exist a DNN classifier with no more than \(q - 1\) width-two-hidden layers from \(H_{R}\).
5 The Main Result and Examples
Now we can state the main result of this paper.
Theorem 5.1
Consider the finite set \(D_{q} = \{ x_{1} ,...,x_{q} \} \subset R^{n}\) that can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). then there exist a DNN classifier \(Net\) with no more than \(q - 1\) width-two hidden layers.
Proof
From Theorem 4.1, it is easy to see that there is a linear map \(L:R^{n} \to R\) such that \(y_{i} = L(x_{i} ) \ne y_{j} = L(x_{j} )\) if \(i \ne j\). Now consider the points \(y_{1} ,y_{2} ,...,y_{q}\) on \(R\) and apply □
Theorem 4.2
To illustrate the above theorem, we consider some toy examples, hoping provide an insight for practitioners in designing DNNs dealing with practical problems.
Examples
To illustrate the above theory we provide two toy examples.
Example 1
Let \(D_{4} = \{ x_{1} ,x_{2} ,x_{3} ,x_{4} \} \subset R^{2}\), where.
\(x_{1} { = (}1,1)^{T}\), \(x_{2} { = (} - 1,1)^{T}\), \(x_{3} { = (} - {1,} - {1)}^{T}\), \(x_{4} { = (1,} - {1)}^{T}\).
as shown in Fig. 1
The binary classification for the set in Fig. 1 is formulated as follows:\(x_{i}\) is labeled 1 if and only if exactly one of its coordinate components is 1. Otherwise \(x_{i}\) is labeled 2. Now we want to design a DNN net to classify this set. Note that any projection \(P:R^{n} \to R\) with \(R\) not being the coordinate axis is label preserving and has the property:
\(P(x_{2} )\) and \(P(x_{4} )\) lie between \(P(x_{1} )\) and \(P(x_{3} )\).
Or.
\(P(x_{1} )\) and \(P(x_{3} )\) lie between \(P(x_{2} )\) and \(P(x_{4} )\).
This shows that the set of points of \(P(x_{1} )\),\(P(x_{2} )\),\(P(x_{3} )\) and \(P(x_{4} )\) has just one alternative point. Therefore, a DNN net with one hidden layer from \(H_{R}\) can classify \(D_{4}\) in view of Theorem 5.1.
There are numerous narrow DNNs able to implement this toy classification problem as long as they are designed as described in the arguments of Proposition 4.1 and Lemma 4.1. For instance, one such a neural net can be designed as follows.
Let \(P_{u} :R^{2} \to L_{u}\) be the projection map, where \(L_{u} \subset R^{2}\) is the linear space spanned by \(u\) with \(u = \frac{1}{\sqrt 2 }(1,1)\). Then \(P_{u} (x) = < x,u > u\). For convenience we.
Now let
Then let \(Net(x) = P_{u} \circ \sigma_{R} \circ A \circ P_{u} (x) = P_{u} \circ F_{1} (x)\), it is easy to verify that.
\(Net(x_{1} ) = Net(x_{3} ) = \frac{1}{\sqrt 2 }u\) and \(Net(x_{2} ) = Net(x_{4} ) = \{ 0\}\).
We can construct a linear function to separate them. Or because \(\sigma_{R} \circ A \circ P_{u} (D_{4} ) = F_{1} (D_{4} )\) are separable by a linear function, say,\(\eta = - X + Y + (\tfrac{1}{2},\tfrac{1}{2})^{T}\) defined on \(R^{2}\), so that the net \(Net = \eta \circ F_{1}\) classifies \(D_{4}\), It has only one hidden layer \(F_{1} = \sigma_{R} \circ A \circ P_{u}\).
In the theorems above, we see that the arguments in their proofs are based on the existence of linear dimension reduction map in the first layer of a DNN net that is injective on the considered data set. This implies that the number of alternative points for image set under the first layer linear map is dependent on how we select a linear map, consequently different linear dimension reduction map gives different depth of DNN nets to finish the binary classification. We demonstrate this point in the following example.
Example 2
Consider the set \(D_{4} = \{ x_{1} ,x_{2} ,x_{3} ,x_{4} \} \subset R^{2}\), with.
\(x_{1} { = (}1,1)^{T}\), \(x_{2} { = (} - 1,1)^{T}\), \(x_{3} { = (} - {2,} - {1)}^{T}\), \(x_{4} { = (2,} - {1)}^{T}\).
As illustrated in Fig. 2.
In this example, a binary classifier for this set is formulated as follows: \(x_{i}\) is labeled 1 iff exactly one of its coordinate components is positive. Otherwise \(x_{i}\) is labeled 2.
It is clear that the projection map from \(R^{2}\) to its \(X\)-axis is label preserving. Denote it by \(P_{X}\), then we have \(P_{X} (x_{3} ) < P_{X} (x_{2} ) < P_{X} (x_{1} ) < P_{X} (x_{4} )\). This sequence has 3 alternative points. It follows from Proposition 4.1 that we can have a classifier with two width-2-hidden layers from \(H_{R}\). Now we provide a concrete classifier.
Let \(P_{X} :R^{2} \to L_{X}\) be the projection map from \(R^{2}\) to the \(X\)-axis. Then we have.
\(y_{1} = P_{X} (x_{1} ) = (1,0)^{T}\),\(y_{2} = P_{X} (x_{2} ) = ( - 1,0)^{T}\),
\(y_{4} = P_{X} (x_{4} ) = (2,0)^{T}\),\(y_{2} = P_{X} (x_{2} ) = ( - 1,0)^{T}\).
Define the affine map \(A_{1} (x) = W_{1} x + b\) with.
\(W_{1} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ { - 1} & 1 \\ \end{array} } \right]\), \(b_{1} = \left[ {\begin{array}{*{20}c} { - 0.5} \\ {0.5} \\ \end{array} } \right]\).
Then it is easy to verify the following.
\(\sigma_{R} \circ A_{1} (y_{1} ) = (0.5,0)^{T}\), \(\sigma_{R} \circ A_{1} (y_{2} ) = (0,1.5)^{T}\),
\(\sigma_{R} \circ A_{1} (y_{3} ) = (0,2.5)^{T}\), \(\sigma_{R} \circ A_{1} (y_{4} ) = (1.5,0)^{T}\).
Let \(P_{v} :R^{2} \to L_{v}\) be the linear mapping defined as \(P_{v} (x) = < x,v > v\) with \(L_{v}\) being the linear space spanned by \(v\), \(v = (1,1)\). Then.
\(z_{1} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{1} ) = (0.5,0.5)^{T}\), \(z_{2} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{2} ) = (1.5,1.5)^{T}\).
\(z_{3} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{3} ) = (2.5,2.5)^{T}\), \(z_{4} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{4} ) = (1.5,1.5)^{T}\).
Let \(A_{2} (x) = W_{2} x + b_{2}\), where.
\(W_{2} = \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right]\), \(b_{2} = \left[ {\begin{array}{*{20}c} {1.5} \\ { - 1.5} \\ \end{array} } \right]\).
It is easy to see that.
\(\sigma_{R} \circ A_{2} (z_{1} ) = (1,0)^{T}\), \(\sigma_{R} \circ A_{2} (z_{2} ) = \sigma_{R} \circ A_{2} (z_{4} ) = (0,0)^{T}\), \(\sigma_{R} \circ A_{2} (z_{3} ) = (0,1)^{T}\).
Now it is obvious that the above three labeled image points are separable by a linear function, say,\(\eta\) defined by.
\(\eta = - X + Y + (\tfrac{1}{2},\tfrac{1}{2})^{T}\).
Then the net defined by
is a desired classification DNN net with two hidden layers.
Nonetheless, consider another projection map. For instance, consider the vector \(v = x_{2} - x_{4}\), and let \(L^{v} = \{ x \in R^{2} : < x,v > = 0\}\), then the projection \(P^{v} :R^{2} \to L^{v}\) is also label preserving and satisfies
The new sequence of points thus obtained has 2 alternative points. Thus we can have a classifier with one width-2 hidden layer from \(H_{R}\). The readers can construct themselves a specific classifier following the procedure above.
6 The Size and Capacity of DNNs in Finite Set Binary Classification
First let us set up some notions and notations. Consider the net
Let \(w_{i}\) be the width of the hidden layer \(F_{i}\), \(i = 1,2,...,k\). According to [1], the size \(Si\) of the above DNN is defined as follows.
Definition 6.1
The size of (6.1).
Definition 6.2
The parameter capacity \(Cp(Net^{R} )\) of \(Net^{R}\) is defined to be the number of parameters of (5.1) that are adjustable. That is the dimension of the space of adjustable parameters.
Now in view of the arguments in Sects. 4 and 5, we have the following observations.
Theorem 6.1
Consider the finite set \(D_{q} = \{ x_{1} ,...,x_{q} \} \subset R^{n}\) that can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). then there exist a class of DNN classifier \(Net^{R}\) satisfying the following inequality.
\(Si(Net^{R} ) \le 2(q - 1)\)\
and
Proof
The affine map \(A_{1} (x)\) in the first hidden layer \(F_{1} (x)\) is in fact a projection and can be written as.
with \(v \in S^{n - 1}\) being the adjustable vector. Note that every weight matrix in \(A_{i} (x) = W_{i} x + b_{i}\) for \(i = 2,...,k\) can be written as.
\(W_{i} = \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right]P_{v} + b_{i}\).
and the bias \(b_{i}\) can be written as.
\(b_{i} = \zeta_{i} ( - 1,1)^{T}\). \(\zeta_{i} \in R\), \(i = 1,2,...,q - 1\).
Then it is easy to see that the statement holds.□
7 Summary
In this paper we have proved that for a labelled finite set in a Euclidean space, it is always possible to construct a width -2- DNN net to finish binary classification task. The remarkable fact obtained in this paper is that it is now clear that the constructed neural net can have its hidden layers not larger than the cardinality of the finite labelled set, and every hidden layer is of width-two! We hope that this fact can shed new light on relation of “width” and “depth” with designing of DNN net in classification problems.
In addition, in designing of the above classifying nets, we have seen how the geometric images of data set change in the internal representation across all depth of neural networks, thus making the “black box” of this kind of DNN net transparent.
Examples provided in Sect. 5 show that the projection mapping in the first layer affects the depth of the designed DNN net. This fact remind us that one should carefully choose the first initial hidden layer in designing practical DNN nets.
Moreover, from learning perspective, one usually divide a dataset into a training set and a testing set, thus whether the DNN design based on the training set can generalize the testing set remains to be a hard practical problem to be investigated. We will touch on this problem in a forthcoming paper.
Because the purpose of the present paper is try to shed new light on what size of a DNN net in terms of “width” and “depth” is capable of classifying finite dataset with two labels, we hope the existence results and examples in this paper may be of help to practitioners in coping with practical problems.
It should be stressed that the approach proposed in this paper is not practical in dealing with real-world classification problems, because real datasets are usually contaminated. Besides, two data points can not be discriminated by DNNs in practice if they are too closed with each other. However, the result of the present paper can shed light on the mechanism of how DNNs works and contribute to the interpretability of DNNs, which may inspire people to find more economic approaches to the design and development of new architectures of DNNs for real world problems,
References
Arora R, Basu A, Mianjy P, Mukherjee A, Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
Beneventano P, Cheridito P, Graeber R, Jentzen A, Kuckuck B , Deep neural network approximation theory for high-dimensional functions, arXiv:2112.14523 [math.NA], 2021.
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control, Signals Syst 2(4):303–314
Hanin B, 2019 Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. Mathematics, 7(10):992
Hanin B, Sellke M, Approximating continuous functions by ReLU nets of minimal width. arXiv preprint arXiv:1710.11278, 2017.
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4(2):251–257
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Kawaguchi K, Deep learning without poor local minima. arXiv preprint arXiv:1605.07110, 2016.
Mhaskar HN, Poggio T (2016) Deep versus shallow networks: an approximation theory perspective. Anal Appl 14(06):829–848
Montufar GF, Pascanu R, Cho K., and Bengio Y, 2014. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932
Rolnick D, Tegmark M, The power of deeper networks for expressing natural functions. arXiv preprint arXiv:1705.05502, 2017.
Serra T, Tjandraatmadja C, Ramalingam S, Bounding and Counting Linear Regions of Deep Neural Networks, arXiv:1711.02114 [cs.LG], 2018
Telgarsky M, Benefits of depth in neural networks, JMLR: Workshop and Conference Proceedings vol 49:1–23, 2016
Telgarsky M, Representation benefits of deep feedforward networks. arXiv preprint, arXiv:1509.08101, 2015.
Telgarsky M, Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.
Author information
Authors and Affiliations
Contributions
I am the single author.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yang, XS. A Geometric Theory for Binary Classification of Finite Datasets by DNNs with Relu Activations. Neural Process Lett 56, 155 (2024). https://doi.org/10.1007/s11063-024-11612-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11612-1