1 Introduction

Despite the striking success of deep neural networks in applications [7], the deep neural networks remain to be a pile of techniques to some extent, like “Jack of all trades” and playing with black boxes [1, 7, 8]. Therefore, one of outstanding challenges is to understand the working mechanism of the deep neural networks (DNNs) and develop a solid mathematical framework for understanding and explaining the effectiveness and failures of deep neural networks, which, in general, is a sequential function composition. In a designed DNN net, knowing how the datasets are transformed through each layer of the DNN is of significance in understanding the working mechanism of DNNs. This remains to be a great challenge and we are trying to take a step forward along this line of research in the setting of binary classification in the present paper, with the motivation that the research could shed light on interpreting some unexpected failures, such as shortcut learning in DNNs.

Up to now, the related work mainly focus on the so called expressivity [2,3,4,5,6, 9, 10, 12,13,14], which involve the ability of DNNs to produce accurate function approximations, all from the perspective of approximation theory. Some research is also on the investigation of complexity of DNNs [10, 12] in terms of decision regions and decision boundaries, thus give new insight regarding the advantage of depth for neural networks with piecewise linear activation functions. Recently, some authors pay more attention to connection of width-depth with expressivity [5, 6, 12], i.e., the ability of DNNs in approximation of continuous functions. In particular, upper and lower bounds of widths for a DNN to have approximation ability of continuous functions were obtained in [5, 6].

Unlike the aforementioned work above, our motivation is to geometrically understand the working mechanism of the deep neural networks (DNNs), therefore we are more concerned with the geometric properties of data in the internal representation across all depth of neural networks and the dynamics of data transformation through the layers of DNNs and try to find answers to general questions such as: how width-depth is related to a neural net in classification task and how the dataset from input space are changed in learning process layer by layer in hidden layers.

Because of sequential maps composition structure of the deep neural networks, it is helpful to investigate the deep neural networks from mapping perspective, and the purpose of this paper is to provide a model to demonstrate how a narrow DNN can carry out binary classification task for finite data set by virtue of mapping perspective. This model provides a geometric demonstration of how a data set is transformed across layers of a DNN for classification task, which may be helpful in understanding the phenomenon of shortcutting learning.

In this paper we are interested in the following question.

Let \(D_{q} \subset R^{n}\) be a dataset containing \(q\) points with two labels, such that \(D_{q} = C_{1} \cup C_{2}\). where \(C_{1}\) is the subset of \(D_{q}\) labelled 1 and \(C_{2}\) the subset of \(D_{q}\) labelled 2.

Question Can we design a deep neural network with its size being as small as possible while classifying the points in the above data set?

This question is of practical significance, and is surely a theoretical challenge from mathematical view of point.

In many applications deep learning uses feedforward neural network architecture, which is a composition of a sequence of maps, to learn mapping an input dataset to the output dataset that is finally implemented by some simple classifying function that plays the role of labelling. A basic idea can be described as follows. To go from one layer to the next, the dataset is computed as a weighted sum of their inputs from the previous layer and pass the result through a non-linear function, called activation function. The most popular activation function is the rectified linear unit (ReLU) considered in the present paper. A compositing map that is not in the input or output layer is conventionally called hidden layer. The hidden layers play the essential role in distorting the input space in a non-linear way so that categories become linearly separable by the last layer, and this is the point toward which all the arguments we will give in the main body of this paper.

In this paper we investigate the deep neural networks for binary classification of datasets from geometric perspective. By establishing a geometrical result on injectivity of finite set under a projection from Euclidean space to the real line and introducing notions of alternative points and alternative number, we prove existence of binary classification DNNs with the hidden layers of width two and the number of hidden layers not larger than the cardinality of the finite labelled set. This is a noticeable fact to the knowledge of the author.

As shown in the sequel, in the proof we also demonstrate geometrically how the dataset is transformed across every hidden layers in a DNN for binary classification task of finite labeled points in any Euclidean space. Moreover, we also provide some models to illustrate how a narrow DNN with width-two hidden layers can carry out binary classification task for finite datasets.

We hope that the results in the present paper may shed light on the mechanism of how DNNs work and contribute to the interpretability of DNNs, so that these results may inspire people to find more economic approaches to the design and development of new architectures of DNNs, despite of not being practical in dealing with practical datasets.

Now we first recall in this section some preliminaries for deep neural networks (DNN) with Relu activations. One typical nonlinear activation function is the rectified linear unit defined as

$$ \begin{array}{*{20}c} {\sigma_{R} (x) = {\text{Re}} lu(x) = \max \{ x,0\} ,} & {x \in R} \\ \end{array} $$

In the vector form, we have the map \(\sigma_{R} :R^{n} \to R^{n}\) which is component wisely defined as

$$ \begin{array}{*{20}c} {\sigma_{R} \left( x \right) = \left( {\sigma_{R} \left( {x_{1} } \right),...,\sigma_{R} \left( {x_{n} } \right)} \right)^{T} ,} & {x = \left( {x_{1} ,...,x_{n} } \right)^{T} \in R^{n} } \\ \end{array} $$

In DNNs, a function defined in a hidden layer can be written as

$$ F:R^{n} \to R^{m} $$
$$ F\left( x \right) = \sigma_{R} \circ A\left( x \right) $$

A deep neural network with \(\sigma_{R}\) activation is a continuous map \(Net:R^{n} \to R^{m}\),\(n > m\), that is the composition of a finite sequence of functions and is of the form

$$ \begin{aligned} Net^{R} = & F_{out} \circ F_{k} \circ F_{k - 1} \circ ... \circ F_{1} \left( x \right) \\ \begin{array}{*{20}c} {} & {} \\ \end{array} = & F_{out} \circ \sigma_{R} \circ A_{k} \circ \sigma_{R} \circ A_{k - 1} \left( x \right) \circ ... \circ \sigma_{R} \circ A_{1} \left( x \right) \\ \end{aligned} $$
(1)
$$ \begin{array}{*{20}c} {F_{i} :R^{{n_{i} }} \to R^{{n_{i + 1} }} ,} & {i = 1,...,k} & {{\text{with}}} & {n_{1} = n\;{\text{and}}\;} \\ \end{array} n_{k} = m $$
$$ F_{i} (x) = {\text{Re}} lu \circ A_{i} \left( x \right) $$

\(\begin{array}{*{20}c} {A_{i} \left( x \right) = W_{i} x + b_{i} ,} & {W_{i} \in M_{{m_{i} \times n_{i} }}^{{}} ,} & {b_{i} \in R^{{m_{i} }} } \\ \end{array}\).where \(F_{out}\) is a classifying function as defined in next section.

Definition 1.1 Define the hidden layer space \(H_{R}\) as follows. A map \(f:R^{n} \to R^{m}\) satisfies \(f \in H_{R}\), if and only ifwith

$$ f = \sigma_{R} \circ A\left( x \right) $$
$$ \begin{array}{*{20}l} {A\left( x \right) = Wx + b,} \hfill & {W \in M_{m \times n}^{{}} ,} \hfill & {x \in R^{n} ,} \hfill & {b \in R^{m} ,} \hfill & {n > 0} \hfill \\ \end{array} $$
(2)

The width of hidden layer map of \(f\) is defined to be \(m\), the dimension of \(R^{m}\). That is, the number of neurons in the hidden layer in terms of neural networks. The depth of the net \(Net^{R}\) of is defined to be the number of layers taken from \(H_{R}\), i.e., the number of hidden layers.

According to [1], the size \(Si\) of the above DNN is defined as follows.

Definition 1.2

The size of (1) is.

$$ Si\left( {Net^{R} } \right) = w_{1} + w_{2} + ... + w_{k} $$

Definition 1.3

The parameter capacity \(Cp\left( {Net^{R} } \right)\) of \(Net^{R}\) is defined to be the number of parameters of (1) that are adjustable.

Before developing the main theory we set up some notations.

$$ X_{1}^{ + } = \left\{ {x = \left( {x_{1} ,x_{2} } \right) \in R^{2} :x_{1} \ge 0,x_{2} = 0} \right\} $$
$$ X_{2}^{ + } = \left\{ {x = \left( {x_{1} ,x_{2} } \right) \in R^{2} :x_{1} = 0,x_{2} \ge 0} \right\} $$

Let \(P_{v} :R^{2} \to L_{v}\) be the projection map, \(L_{v} \subset R^{2}\) is the linear space spanned by \(v\), \(v = \left( {1,1} \right)\).

$$ R_{ + }^{2} = \left\{ {x = \left( {x_{1} ,x_{2} } \right) \in R^{2} :x_{1} \ge 0,x_{2} \ge 0} \right\} $$

\(R_{ - }^{2} = \left\{ {x = \left( {x_{1} ,x_{2} } \right) \in R^{2} :x_{1} \le 0,x_{2} \le 0} \right\}\).

Consider two points \(x = \left( {x_{1} ,x_{2} } \right) \in R^{2}\) and \(x = \left( {y_{1} ,y_{2} } \right) \in R^{2}\), we say that \(x < y\) if \(x_{1} < y_{1} ,x_{2} < y_{2}\).

For later arguments, we recall some basic properties of Relu maps.

Consider a two dimensional Relu map \(\sigma_{R} :R^{2} \to R^{2}\). We have the following observations.

  1. (1)

    \(\sigma_{R} \left( {R^{2} } \right) \subseteq \partial R_{ + }^{2}\)

  2. (2)

    If \(L \subset R^{2}\) is a linear one dimensional subspace satisfying \(L \cap R_{ + }^{2} \cap {\text{ = \{ 0\} }}\), then \(\sigma_{R} :L \to \partial R_{ + }^{2}\) is a homeomorphism.

  3. (3)

    Every set \(S \subset R_{ - }^{2}\) satisfies \(\sigma_{R} \left( S \right) = \left\{ 0 \right\}\)

The main result obtained in this paper can be sated as follows.

1.1 The main results

We first prove a geometric fact concerning linear map on finite sets

Result 1 Let \(D \subset R^{n}\) be a finite set. There exists a liner map \(L:R^{n} \to R\) such that \(L\) restricted to \(D\) is injective.

Based on this fact we can prove the following facts about the DNN net for binary classification

problem.

Result 2 Let \(D_{q} \subset R^{n}\) be a compact data set containing \(q\) points with two labels. Then there exists a DNN \(Net\) has the following properties.

  1. (1)

    The hidden layers of \(Net\) are of width 2.

  2. (2)

    The hidden layers of \(Net\) are of depth not larger than \(q - 1\)

  3. (3)

    \(Net\) can classify the data set \(D_{q}\).

Result 3 For the above labelled set, there exist a class of DNN classifiers, each \(Net^{R}\) of it satisfies the following inequality

$$ Si\left( {Net^{R} } \right) \le 2\left( {q - 1} \right)\;{\text{and}}\;Cp\left( {Net^{R} } \right) \le n + q - 1 $$

The rest of this paper is organized as follows. We first give some preliminaries and introduce notions of alternative points and alternative number for labeled sets in real line \(R\) in Sect. 2. Then in Sect. 3 we prove a geometric fact which characterizes a connection between finite sets in a general Euclidean space to finite sets in real line under linear dimensionality reduction map. Then in Sect. 4, based on the alternative number and the geometric fact we set up a DNNs binary classification framework for labeled sets in real line in the setting of Relu activation. With these prepared work we easily come to the conclusion of the main theorem in and Sects. 5 and 6.

2 Alternative Points and Alternative Number of Labeled Set

As in the first section, consider the data set \(D_{q} \subset R^{n}\) containing \(q\) points with two labels, such that \(D_{q} = C_{1} \cup C_{2}\)..

Definition 2.1

If there exists a continuous function \(\xi :R^{n} \to R\) such that.

\(\xi (x) > 0\), \(x \in C_{1}\).

\(\xi (x) < 0\), \(x \in C_{2}\).

Then \(\xi\) is called a classifying function. In particular, \(\xi\) is called a linear classifying function if it is a linear function.

Definition 2.2

Consider a compact set \(D \subset R^{n}\) with two labels, such that \(D = C_{1} \cup C_{2}\). A continuous map \(f:R^{n} \to R^{m}\) is said to be label preserving with respect to \(D\), if it satisfies.

$$ f\left( {C_{1} } \right) \cap f\left( {C_{2} } \right) = \emptyset $$

Now consider a finite set \(A_{q} = \left\{ {a_{1} ,...,a_{q} } \right\} \subset R\) with \(a_{1} < a_{2} < ,..., < a_{q}\).

Definition 2.3

Suppose \(A_{q}\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(a_{i}\) and \(a_{i - 1}\) belong to \(C_{1}\) and \(C_{2}\), respectively, then \(a_{i}\) is called an alternative point of \(A_{q}\). The number of alternative points of \(A_{q}\) is called alternative number of \(A\) and denoted by \(altA_{q}\).

Example

Consider \(A = \left\{ {a_{1} ,a_{2} ,a_{3} } \right\} \subset R\). If \(a_{1} ,a_{2} \in C_{1}\) and \(a_{3} \in C_{2}\), then \(A\) has one alternative point \(a_{3}\), and \(alt\;A = 1\). If \(a_{1} ,a_{3} \in C_{1}\) and \(a_{2} \in C_{2}\), then \(A\) has two alternative point \(a_{2}\) and \(a_{3}\), and \(alt\;A = 2\).

The following fact is obvious.

Proposition 2.1 For any \(A_{q} = \left\{ {a_{1} ,...,a_{q} } \right\} \subset R\), we have \(alt\;A_{q} \le q - 1\).

3 A Geometric Fact Concerning Linear Map

First we give a geometric result concerning a linear map defined on a finite set.

Theorem 3.1

Let \(D_{q} = \left\{ {x_{1} ,...,x_{q} } \right\} \subset R^{n}\) be a finite set. There exists a liner map \(L:R^{n} \to R\) such that \(L\) restricted to \(D_{q}\) is injective.

Proof

Consider \(D_{q} \times D_{q}\) and let.

$$ D_{\Delta } = \left( {x,x} \right) \in D_{q} \times D_{q} - \Delta ,\;\Delta = \left\{ {\left( {x,x} \right):\left( {x,x} \right) \in D_{q} \times D_{q} } \right\} $$
(3)

For each pair \(\left( {x_{xi} ,x_{xj} } \right) \in D_{\Delta }\), it is clearly that \(v_{{x_{i} x_{j} }} = x_{{x_{i} }} - x_{{x_{j} }} \ne 0\). Now consider the \(n - 1\) dimensional linear subspace \(S_{{x_{i} x_{j} }}\) defined by \(v_{{x_{i} x_{j} }}\).

\(S_{xixj} = \left\{ {x \in R^{n} : < x,v_{xixj} > = 0} \right\}\), \(< , >\) is the Euclidean inner product.

Let

$$ S = \mathop \cup \limits_{i < j} S_{{x_{i} x_{j} }} $$

Then the Lebesgue measure of \(S\) in \(R^{n}\) is zero, so \(R^{n} - S\) is nonempty. Take any vector \(u \in R^{n} - S\), we have

$$ < u,v_{xixj} > = < u,\left( {x_{xi} - x_{xj} } \right) > \ne 0 $$

It follows that the linear map defined by \(L\left( x \right) = < u,x > u\) is injective on \(D_{q}\)□.

Remark

The arguments provided in the above proof shows that if \(D \subset R^{n}\) consists countable infinite points, the same statement still holds true.

4 A Binary Classification Theory for Labeled Sets in Real Line

For reader’s convenience to grasp the main idea of the sequel arguments, we first consider the simplest case where \(D_{3} = \left\{ {x_{1} ,x_{2} ,x_{3} } \right\} \subset R^{1}\).

Proposition 4.1

For \(D_{3} = \{ x_{1} ,x_{2} ,x_{3} \} \subset R^{1}\), we have the following observations.

If \(alt\;D_{3} = 1\), then it is trivial to have a linear classifier.

If \(alt\;D_{3} = 2\), then we can have a net with one width-two hidden layer in \(H_{\sigma }\). Without loss of generality, assume that \(\left\{ {x_{1} ,x_{3} } \right\} \subset C_{1}\), \(x_{2} \in C_{2}\).

Proof

Note that \(x_{1} < x_{2} < x_{3}\), we can let \(c = \tfrac{{x_{1} + x_{3} }}{2}\). Then we consider the affine map.

\(A_{1} :R \to R^{2}\) such that.

\(A_{1} \left( R \right) \bot v\), \(A_{1} \left( C \right) = 0 \in R^{2}\).

Now define \(\sigma^{R} \circ A_{1} :R \to R^{2}\), then \(\sigma^{R} \circ A_{1} \left( R \right) \subseteq \partial R_{ + }^{2}\). It is easy to see that on the linear space \(L_{v}\) the map \(f_{1} = P_{v} \circ \sigma^{R} \circ A_{1}\) is label preserving and has the property

$$ f_{1} \left( {x_{2} } \right) < f_{1} \left( {x_{1} } \right) = f_{1} \left( {x_{3} } \right) $$

It obvious that we can define a linear map \(\overline{\xi }:R^{2} \to R\) to classify \(\left\{ {x_{1} ,x_{3} } \right\}\) and \(x_{2}\). For instance.

define a map.

\(\overline{\xi } = v \cdot x + p\), with \(f_{1} \left( {x_{2} } \right) < p < f_{1} \left( {x_{1} } \right)\).

Define the map \(\xi = \overline{\xi } \circ P_{v}\), the net \(Net = \xi \circ \sigma^{R} \circ A_{1} = \overline{\xi } \circ P_{v} \circ \sigma^{R} \circ A_{1}\)□ is the desired DNN.

Now let us turn to more general case.

Lemma 4.1

Suppose \(A_{q} = \left\{ {x_{1} ,...,x_{q} } \right\} \subset R\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(alt\;A_{q} = 2\), then there exist a DNN classifier with no more than two hidden layers.

Proof

Without loss of generality, suppose \(x_{i}\) and \(x_{j}\),\(i < j\), are two alternative points.

Case 1 \(i = 2,j = 3\). In this case, as shown in Proposition 4.1, we can define the map \(f_{1} = P_{v} \circ \sigma^{R} \circ A_{1}\) is label preserving and has the property

$$ f_{1} \left( {x_{2} } \right) < f_{1} \left( {x_{1} } \right) = f_{1} \left( {x_{3} } \right) < ..., < f_{1} \left( {x_{q} } \right) $$

Case 2 \(i > 2,j = i + 1\). First define an affine map \(A_{1} :R \to R^{2}\) such that \(A_{1} \left( R \right) \subseteq Lv\) and.

\(A_{1} \left( {a_{k} } \right) \in R_{ - }^{2}\),\(k = 1,2,...,i - 2\), \(A_{1} \left( {a_{i - 1} } \right) = 0 \in R^{2}\) and \(A_{1} \left( {a_{k} } \right) \in R_{ + }^{2}\) for \(k \ge i\). Then the map \(f_{1} = \sigma^{R} \circ A_{1}\) is label preserving and has the property that \(f_{1} \left( {a_{k} } \right) = 0\) for \(k = 1,2,...,i - 1\) and \(f_{1} \left( {a_{k} } \right) = A_{1} \left( {a_{k} } \right) \in R_{ + }^{2}\) for \(k \ge i\). Moreover,

$$ f_{1} \left( {x_{i - 1} } \right) < f_{1} \left( {x_{i} } \right) < f_{1} \left( {x_{j} } \right)..., < f_{1} \left( {x_{q} } \right) $$

For this sequence of the points on \(Lv\), we can apply the procedure of Case 1 to define a label preserving map \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2}\), so that

$$ f_{2} \circ f_{1} \left( {x_{i} } \right) < f_{2} \circ f_{1} \left( {x_{i - 1} } \right) = f_{2} \circ f\left( {x_{j} } \right)_{1} < f_{2} \circ f_{1} \left( {x_{j + 1} } \right),..., < f_{2} \circ f_{1} \left( {x_{q} } \right) $$

As in Proposition 4.1, we can define a map in the form.

\(\overline{\xi } = v \cdot x + p\), with \(f_{2} \circ f_{1} \left( {x_{i} } \right) < p < f_{2} \circ f_{1} \left( {x_{j} } \right)\).

So \(Net = \xi \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1} = \overline{\xi } \circ P_{v} \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1}\) is a desired classifier.

Case 3 \(i > 2,j > i + 1\). In this general case we will show that two hidden layers are enough to classify the data points. As in Case 2, we can construct a \(f_{1} = \sigma^{R} \circ A_{1}\) so that

$$ f_{1} \left( {x_{i - 1} } \right) < f_{1} \left( {x_{i} } \right) < f_{1} \left( {x_{i + 1} } \right) < ,..., < f_{1} \left( {x_{j} } \right)..., < f_{1} \left( {x_{q} } \right) $$

Let \(c = \frac{{f_{1} \left( {x_{i - 1} } \right) + f_{1} \left( {x_{j} } \right)}}{2}\) then similar to the arguments for Proposition 4.1,we consider the affine map \(A_{2} :R \to R^{2}\) such that.

\(A_{2} \left( R \right) \bot v\), \(A_{1} (c) = 0 \in R^{2}\).

Now define the label preserving \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2} :R \to R^{2}\), then on the linear space \(L_{v}\) the map \(f_{2} = P_{v} \circ \sigma^{R} \circ A_{2}\) has the property.

\(f_{2} \circ f_{1} \left( {x_{i - 1} } \right) = f_{2} \circ f_{1} \left( {x_{j} } \right) < f_{2} \circ f_{1} \left( {x_{j + 1} } \right) < ..., < f_{2} \circ f_{1} \left( {x_{q} } \right)\).

In addition, \(f_{2} \circ f_{1} \left( {x_{k} } \right) < f_{2} \circ f_{1} \left( {x_{i - 1} } \right)\) for \(k = i,...,j - 1\).

Thus we have only one alternative point \(b_{j} = f_{2} \circ f_{1} \left( {x_{i - 1} } \right) = f_{2} \circ f_{1} \left( {x_{j} } \right)\) in the new sequence of the points on \(L_{v}\) satisfying \(b_{j} < f_{2} \circ f_{1} \left( {x_{j + 1} } \right) < ..., < f_{2} \circ f_{1} \left( {x_{q} } \right)\). And the map of the form.

\(\overline{\xi } = v \cdot x + p\), with \(\mathop {\max }\limits_{k = i,...,j - 1} f_{2} \circ f_{1} (x_{i} ) < p < b_{j}\).

Classifies the new sequence of the points. Finally, let

$$ Net = \xi \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1} = \overline{\xi } \circ P_{v} \circ \sigma^{R} \circ A_{2} \circ \sigma^{R} \circ A_{1} $$

which is a desired classifier for \(A_{q}\) and has two hidden layers from \(H_{R}\)□.

Now we can obtain a more general statement.

Proposition 4.2

Suppose \(A_{q} \subset R\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). If \(alt\;A_{q} = h\), then there exist a DNN classifier with no more than \(2(h - 1)\) hidden layers from \(H_{R}\) with width 2..

Proof

It is enough to show that we can construct a DNN net that maps the set \(A_{q}\) to a finite set on \(L_{v}\) having only one alternative point. Let \(x_{i1} ,...,x_{ih} \in A_{q}\) are the alternative points satisfying \(x_{i1} < ,..., < x_{ih}\). From the previous arguments, it is easy to see that there is a DNN Net with no more than two hidden layers from \(H_{R}\) such that.

\(Net_{1} \left( {x_{k} } \right) = Net_{1} \left( {x_{i2} } \right),\)\(k = 1,2,...,i_{1} - 1\),

\(Net_{1} \left( {x_{k} } \right) < Net_{1} \left( {x_{i2} } \right),\)\(k = i_{1} ,i_{1} + 1,...,i_{2} - 1\),

and

$$ Net_{1} \left( {x_{k} } \right) < Net_{1} \left( {x_{k + 1} } \right),k = i_{2} ,...,q $$

Therefore we have a new image sequence of points on \(L_{v}\) with \(h - 1\) alternative points. Continue this way by constructing a new \(Net_{2}\) with at most two hidden layers from \(H_{R}\), we obtain again a new sequence of points on \(L_{v}\) with \(h - 2\) alternative points. After a series of constructions of nets \(Net_{3} ,...,Net_{h - 1}\), each having at least two hidden layers from \(H_{R}\), we finally have a sequence of points with only one alternative point. Now it is easy to see that the composition map

$$ Net = Net_{h - 1} \circ Net_{h - 2} \circ ,..., \circ Net_{1} $$
(4)

has the property that \(Net\left( {A_{q} } \right)\) is a sequence of points on \(L_{v}\) having only one alternative point.

Remark

In view of the definition of DNNs in (1) with \(F_{i} (x) = {\text{Re}} lu \circ A_{i} (x)\) and.

$$ \begin{array}{*{20}c} {A_{i} (x) = W_{i} x + b_{i} ,} & {W_{i} \in M_{{m_{i} \times n_{i} }}^{{}} ,} & {b_{i} \in R^{{m_{i} }} } \\ \end{array} $$

we can rewrite for every hidden layer \(A_{i} (x) = W_{i} \circ P_{v} x + b_{i}\), so that (4) can be expressed as the form (1) with \(A_{i} (x) = \overline{W}_{i} x + b_{i} = W_{i} \circ P_{v} x + b_{i}\).

Note that \(A_{q}\) has at most \(q - 1\) alternative points. If \(A_{q}\) has exactly \(q - 1\) alternative points, then this implies that there is no point in \(A_{q}\) that lies between two successive alternative points. It is easy to see that to vanish once a time an alternative point from left, one hidden layer from \(H_{R}\) is enough for this vanishing in view of Proposition 4.1. If there exist \(h\) points between two successive alternative points, then \(A_{q}\) has at most \(q - 1 - h\) alternative points. Take these observations into account, we have the following fact.

Theorem 4.2

Suppose \(A_{q}\) can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\)., then there exist a DNN classifier with no more than \(q - 1\) width-two-hidden layers from \(H_{R}\).

5 The Main Result and Examples

Now we can state the main result of this paper.

Theorem 5.1

Consider the finite set \(D_{q} = \{ x_{1} ,...,x_{q} \} \subset R^{n}\) that can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). then there exist a DNN classifier \(Net\) with no more than \(q - 1\) width-two hidden layers.

Proof

From Theorem 4.1, it is easy to see that there is a linear map \(L:R^{n} \to R\) such that \(y_{i} = L(x_{i} ) \ne y_{j} = L(x_{j} )\) if \(i \ne j\). Now consider the points \(y_{1} ,y_{2} ,...,y_{q}\) on \(R\) and apply □

Theorem 4.2

To illustrate the above theorem, we consider some toy examples, hoping provide an insight for practitioners in designing DNNs dealing with practical problems.

Examples

To illustrate the above theory we provide two toy examples.

Example 1

Let \(D_{4} = \{ x_{1} ,x_{2} ,x_{3} ,x_{4} \} \subset R^{2}\), where.

\(x_{1} { = (}1,1)^{T}\), \(x_{2} { = (} - 1,1)^{T}\), \(x_{3} { = (} - {1,} - {1)}^{T}\), \(x_{4} { = (1,} - {1)}^{T}\).

as shown in Fig. 1

Fig. 1
figure 1

Four points located in the unit square with two labels

The binary classification for the set in Fig. 1 is formulated as follows:\(x_{i}\) is labeled 1 if and only if exactly one of its coordinate components is 1. Otherwise \(x_{i}\) is labeled 2. Now we want to design a DNN net to classify this set. Note that any projection \(P:R^{n} \to R\) with \(R\) not being the coordinate axis is label preserving and has the property:

\(P(x_{2} )\) and \(P(x_{4} )\) lie between \(P(x_{1} )\) and \(P(x_{3} )\).

Or.

\(P(x_{1} )\) and \(P(x_{3} )\) lie between \(P(x_{2} )\) and \(P(x_{4} )\).

This shows that the set of points of \(P(x_{1} )\),\(P(x_{2} )\),\(P(x_{3} )\) and \(P(x_{4} )\) has just one alternative point. Therefore, a DNN net with one hidden layer from \(H_{R}\) can classify \(D_{4}\) in view of Theorem 5.1.

There are numerous narrow DNNs able to implement this toy classification problem as long as they are designed as described in the arguments of Proposition 4.1 and Lemma 4.1. For instance, one such a neural net can be designed as follows.

Let \(P_{u} :R^{2} \to L_{u}\) be the projection map, where \(L_{u} \subset R^{2}\) is the linear space spanned by \(u\) with \(u = \frac{1}{\sqrt 2 }(1,1)\). Then \(P_{u} (x) = < x,u > u\). For convenience we.

Now let

$$ A = \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right] $$

Then let \(Net(x) = P_{u} \circ \sigma_{R} \circ A \circ P_{u} (x) = P_{u} \circ F_{1} (x)\), it is easy to verify that.

\(Net(x_{1} ) = Net(x_{3} ) = \frac{1}{\sqrt 2 }u\) and \(Net(x_{2} ) = Net(x_{4} ) = \{ 0\}\).

We can construct a linear function to separate them. Or because \(\sigma_{R} \circ A \circ P_{u} (D_{4} ) = F_{1} (D_{4} )\) are separable by a linear function, say,\(\eta = - X + Y + (\tfrac{1}{2},\tfrac{1}{2})^{T}\) defined on \(R^{2}\), so that the net \(Net = \eta \circ F_{1}\) classifies \(D_{4}\), It has only one hidden layer \(F_{1} = \sigma_{R} \circ A \circ P_{u}\).

In the theorems above, we see that the arguments in their proofs are based on the existence of linear dimension reduction map in the first layer of a DNN net that is injective on the considered data set. This implies that the number of alternative points for image set under the first layer linear map is dependent on how we select a linear map, consequently different linear dimension reduction map gives different depth of DNN nets to finish the binary classification. We demonstrate this point in the following example.

Example 2

Consider the set \(D_{4} = \{ x_{1} ,x_{2} ,x_{3} ,x_{4} \} \subset R^{2}\), with.

\(x_{1} { = (}1,1)^{T}\), \(x_{2} { = (} - 1,1)^{T}\), \(x_{3} { = (} - {2,} - {1)}^{T}\), \(x_{4} { = (2,} - {1)}^{T}\).

As illustrated in Fig. 2.

Fig. 2
figure 2

Four points with two labels

In this example, a binary classifier for this set is formulated as follows: \(x_{i}\) is labeled 1 iff exactly one of its coordinate components is positive. Otherwise \(x_{i}\) is labeled 2.

It is clear that the projection map from \(R^{2}\) to its \(X\)-axis is label preserving. Denote it by \(P_{X}\), then we have \(P_{X} (x_{3} ) < P_{X} (x_{2} ) < P_{X} (x_{1} ) < P_{X} (x_{4} )\). This sequence has 3 alternative points. It follows from Proposition 4.1 that we can have a classifier with two width-2-hidden layers from \(H_{R}\). Now we provide a concrete classifier.

Let \(P_{X} :R^{2} \to L_{X}\) be the projection map from \(R^{2}\) to the \(X\)-axis. Then we have.

\(y_{1} = P_{X} (x_{1} ) = (1,0)^{T}\),\(y_{2} = P_{X} (x_{2} ) = ( - 1,0)^{T}\),

\(y_{4} = P_{X} (x_{4} ) = (2,0)^{T}\),\(y_{2} = P_{X} (x_{2} ) = ( - 1,0)^{T}\).

Define the affine map \(A_{1} (x) = W_{1} x + b\) with.

\(W_{1} = \left[ {\begin{array}{*{20}c} 1 & 1 \\ { - 1} & 1 \\ \end{array} } \right]\), \(b_{1} = \left[ {\begin{array}{*{20}c} { - 0.5} \\ {0.5} \\ \end{array} } \right]\).

Then it is easy to verify the following.

\(\sigma_{R} \circ A_{1} (y_{1} ) = (0.5,0)^{T}\), \(\sigma_{R} \circ A_{1} (y_{2} ) = (0,1.5)^{T}\),

\(\sigma_{R} \circ A_{1} (y_{3} ) = (0,2.5)^{T}\), \(\sigma_{R} \circ A_{1} (y_{4} ) = (1.5,0)^{T}\).

Let \(P_{v} :R^{2} \to L_{v}\) be the linear mapping defined as \(P_{v} (x) = < x,v > v\) with \(L_{v}\) being the linear space spanned by \(v\), \(v = (1,1)\). Then.

\(z_{1} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{1} ) = (0.5,0.5)^{T}\), \(z_{2} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{2} ) = (1.5,1.5)^{T}\).

\(z_{3} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{3} ) = (2.5,2.5)^{T}\), \(z_{4} = P_{v} \circ \sigma_{R} \circ A_{1} (y_{4} ) = (1.5,1.5)^{T}\).

Let \(A_{2} (x) = W_{2} x + b_{2}\), where.

\(W_{2} = \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right]\), \(b_{2} = \left[ {\begin{array}{*{20}c} {1.5} \\ { - 1.5} \\ \end{array} } \right]\).

It is easy to see that.

\(\sigma_{R} \circ A_{2} (z_{1} ) = (1,0)^{T}\), \(\sigma_{R} \circ A_{2} (z_{2} ) = \sigma_{R} \circ A_{2} (z_{4} ) = (0,0)^{T}\), \(\sigma_{R} \circ A_{2} (z_{3} ) = (0,1)^{T}\).

Now it is obvious that the above three labeled image points are separable by a linear function, say,\(\eta\) defined by.

\(\eta = - X + Y + (\tfrac{1}{2},\tfrac{1}{2})^{T}\).

Then the net defined by

$$ Net = \eta \circ \sigma_{R} \circ A_{2} \circ P_{v} \circ \sigma_{R} \circ A_{1} \circ P_{X} = \eta \circ F_{2} \circ F_{1} (x) $$

is a desired classification DNN net with two hidden layers.

Nonetheless, consider another projection map. For instance, consider the vector \(v = x_{2} - x_{4}\), and let \(L^{v} = \{ x \in R^{2} : < x,v > = 0\}\), then the projection \(P^{v} :R^{2} \to L^{v}\) is also label preserving and satisfies

$$ P^{v} (x_{3} ) < P^{v} (x_{2} ) = P^{v} (x_{4} ) < P^{v} (x_{1} ) $$

The new sequence of points thus obtained has 2 alternative points. Thus we can have a classifier with one width-2 hidden layer from \(H_{R}\). The readers can construct themselves a specific classifier following the procedure above.

6 The Size and Capacity of DNNs in Finite Set Binary Classification

First let us set up some notions and notations. Consider the net

$$ Net^{R} = F_{out} \circ F_{k} \circ F_{k - 1} \circ ... \circ F_{1} (x) = F_{out} \circ \sigma_{R} \circ A_{k} \circ \sigma_{R} \circ A_{k - 1} (x) \circ ... \circ \sigma_{R} \circ A_{1} (x) $$
(6.1)

Let \(w_{i}\) be the width of the hidden layer \(F_{i}\), \(i = 1,2,...,k\). According to [1], the size \(Si\) of the above DNN is defined as follows.

Definition 6.1

The size of (6.1).

$$ Si(Net^{R} ) = w_{1} + w_{2} + ... + w_{k} $$

Definition 6.2

The parameter capacity \(Cp(Net^{R} )\) of \(Net^{R}\) is defined to be the number of parameters of (5.1) that are adjustable. That is the dimension of the space of adjustable parameters.

Now in view of the arguments in Sects. 4 and 5, we have the following observations.

Theorem 6.1

Consider the finite set \(D_{q} = \{ x_{1} ,...,x_{q} \} \subset R^{n}\) that can be divided into two labelled subsets \(C_{1}\) and \(C_{2}\). then there exist a class of DNN classifier \(Net^{R}\) satisfying the following inequality.

\(Si(Net^{R} ) \le 2(q - 1)\)\

and

$$ Cp(Net^{R} ) \le n + q - 1 $$

Proof

The affine map \(A_{1} (x)\) in the first hidden layer \(F_{1} (x)\) is in fact a projection and can be written as.

$$ A_{1} (x) = < x,v > v + b_{1} $$

with \(v \in S^{n - 1}\) being the adjustable vector. Note that every weight matrix in \(A_{i} (x) = W_{i} x + b_{i}\) for \(i = 2,...,k\) can be written as.

\(W_{i} = \left[ {\begin{array}{*{20}c} { - 1} & 0 \\ 0 & 1 \\ \end{array} } \right]P_{v} + b_{i}\).

and the bias \(b_{i}\) can be written as.

\(b_{i} = \zeta_{i} ( - 1,1)^{T}\). \(\zeta_{i} \in R\), \(i = 1,2,...,q - 1\).

Then it is easy to see that the statement holds.□

7 Summary

In this paper we have proved that for a labelled finite set in a Euclidean space, it is always possible to construct a width -2- DNN net to finish binary classification task. The remarkable fact obtained in this paper is that it is now clear that the constructed neural net can have its hidden layers not larger than the cardinality of the finite labelled set, and every hidden layer is of width-two! We hope that this fact can shed new light on relation of “width” and “depth” with designing of DNN net in classification problems.

In addition, in designing of the above classifying nets, we have seen how the geometric images of data set change in the internal representation across all depth of neural networks, thus making the “black box” of this kind of DNN net transparent.

Examples provided in Sect. 5 show that the projection mapping in the first layer affects the depth of the designed DNN net. This fact remind us that one should carefully choose the first initial hidden layer in designing practical DNN nets.

Moreover, from learning perspective, one usually divide a dataset into a training set and a testing set, thus whether the DNN design based on the training set can generalize the testing set remains to be a hard practical problem to be investigated. We will touch on this problem in a forthcoming paper.

Because the purpose of the present paper is try to shed new light on what size of a DNN net in terms of “width” and “depth” is capable of classifying finite dataset with two labels, we hope the existence results and examples in this paper may be of help to practitioners in coping with practical problems.

It should be stressed that the approach proposed in this paper is not practical in dealing with real-world classification problems, because real datasets are usually contaminated. Besides, two data points can not be discriminated by DNNs in practice if they are too closed with each other. However, the result of the present paper can shed light on the mechanism of how DNNs works and contribute to the interpretability of DNNs, which may inspire people to find more economic approaches to the design and development of new architectures of DNNs for real world problems,