Topology-based representative datasets to reduce neural network training resources

One of the main drawbacks of the practical use of neural networks is the long time required in the training process. Such a training process consists of an iterative change of parameters trying to minimize a loss function. These changes are driven by a dataset, which can be seen as a set of labeled points in an n-dimensional space. In this paper, we explore the concept of a representative dataset which is a dataset smaller than the original one, satisfying a nearness condition independent of isometric transformations. Representativeness is measured using persistence diagrams (a computational topology tool) due to its computational efficiency. We theoretically prove that the accuracy of a perceptron evaluated on the original dataset coincides with the accuracy of the neural network evaluated on the representative dataset when the neural network architecture is a perceptron, the loss function is the mean squared error, and certain conditions on the representativeness of the dataset are imposed. These theoretical results accompanied by experimentation open a door to reducing the size of the dataset to gain time in the training process of any neural network.


Introduction
The success of the different architectures used in the framework of neural networks is doubtless Goodfellow et al. [2016].The achievements made in areas such as video imaging Mohammadi et al. [2021], recognition Biswas and Blanco-Medina [2021], or language models Brown et al. [2020] show the surprising potential of such architectures.In spite of such success, they still have some shortcomings.One of their main drawbacks is the long time needed in the training process.Such a long training time is usually associated with two factors: first, the large amount of weights to be adjusted in the current architectures and, second, the huge datasets used to train neural networks.In general, the time needed to train a complex neural network from the scratch is so long that many researchers use pretrained neural networks as, for example, Oxford VGG models Simonyan and Zisserman [2015], Google Inception Model Szegedy et al. [2015], or Microsoft ResNet Model He et al. [2016].Other attempts to reduce the training time are, for example, to partition the training task in multiple training subtasks with submodels, which can be performed independently and in parallel Miranda and Zuben [2015], to use asynchronous averaged stochastic gradient descent You and Xu [2014], and to reduce data transmission through a sampling-based approach Xiao et al. [2017].Besides, in Wang et al. [2018], the authors studied how the elimination of "unfavourable" samples improves generalization accuracy for convolutional neural networks.Finally, in Wang et al. [2020], an unweighted influence data subsampling method is proposed.
Roughly speaking, a training process consists of searching for a local minimum of a loss function in an abstract space where the states are sets of weights.Each of the training sample batches provides an extremely small change in weights according to the training rules.The aim of such changes is to find the "best" set of weights that minimizes the loss function.If we consider a "geometrical" interpretation of the learning process, such changes can be seen as tiny steps in a multidimensional metric space of parameters that follow the direction settled by the gradient of the loss function.In some sense, one may think that two "close" points of the dataset with the same label provide "similar" information to the learning process since the gradient of the loss function on such points is similar.
Such a viewpoint leads us to look for a representative dataset, "close" to and with a smaller number of points than the original dataset but keeping its "topological information", allowing the neural network to perform the learning process taking less time without losing accuracy.This way, in this paper, we show how to reduce the training time by choosing a representative dataset of the original dataset.
Besides, we formally prove, for the perceptron case, that the accuracy of a neural network trained with the representative dataset is similar to the accuracy of the neural network trained with the original dataset.Experimental evidence indicates the same property in representative datasets for the general case.Moreover, in order to "keep the shape" of the original dataset, the concept of representative datasets is associated with a notion of nearness independent of isometric transformations.As a first approach, the Gromov-Hausdorff distance is used to measure the representativeness of the dataset.Nonetheless, as the Gromov-Hausdorff distance complexity is an open problem2 , the bottleneck distance between persistence diagrams Edelsbrunner and Harer [2010] is used instead as a lower bound to the Gromov-Hausdorff distance since its time complexity is cubic on the size of the dataset (see Chazal et al. [2009]).
The paper is organized as follows.In Section 2, basic definitions and results from neural networks and computational topology are given.The notion of representative datasets is introduced in Section 3. Persistence diagrams are used in Section 4 to measure the representativeness of a dataset.In Section 5, the perceptron architecture is studied to provide bounds for comparing training performances within the original dataset and its representative dataset.Specifically, in Subsection 5.1, experimental results are provided for the perceptron case showing the good performance of representative datasets, comparing them with random datasets.In Section 6, we illustrate experimentally the same fact for several multi-layer neural networks.Finally, some conclusions and future work are provided in Section 7.

Background
Next we recall some basic definitions and notations used throughout the paper.

Neural Networks
The research field of neural networks is extremely vivid and new architectures are continuously being presented (see, e.g., CapsNets He et al. [2016], bidirectional feature pyramid networks Tan et al. [2020] or new variants of the Gated Recurrent Units Dey and Salem [2017], Heck and Salem [2017]), so the current notion of neural network is far from the classic multilayer perceptron or radial basis function networks Haykin [2009].
As a general setting, a neural network is a mapping N w,Φ : R n → R m that depends on a set of weights w and a set of parameters Φ which involves the description of the synapses between neurons, layers, activation functions and whatever consideration in its architecture.To train the neural network N w,Φ , we use a dataset which is a finite set of pairs D = {(x, c x ) where, pointx lies in X ⊂ R n and label c x lies in {0, 1, . . ., k}}, for a fixed integer k ∈ N. The sets X and {0, 1, . . ., n} are called, respectively, the set of points and the set of labels in D.
To perform the learning process, we use: (1) a loss function which measures the difference between the output of the network (obtained with the current weights) and the desired output; and (2) a loss-driven training method to iteratively update the weights.

Persistent Homology
In this paper, the representativeness of a dataset will be measured using methods from the recent developed area called Computational Topology whose main tool is persistent homology.A detailed presentation of this field can be found in Edelsbrunner and Harer [2010].
Homology provides mathematical formalism to study how a space is connected, being the q-dimensional homology group the mathematical representation for the q-dimensional "holes" in a given space.This way, the 0-dimensional homology group counts the connected components of the space, the 1-dimensional homology group its tunnels and the 2-dimensional homology group its cavities.For higher dimensions, the intuition of holes is lost.
Persistent homology is usually computed when the homology groups can not be determined.An example of the latter appear when a surface is sampled by a point cloud.Persistent homology is based on the concept of filtration, which is an increasing sequence of simplicial complexes which "change" along time.The building blocks of a simplicial complex are q-simplices.
From the notions of q-simplex and face, the concepts of simplicial complex, subcomplex and filtration arise in a natural way.
Definition 2 A simplicial complex K is a finite set of simplices, satisfying the following properties: A filtration of a simplicial complex K is a nested sequence of subcomplexes, An example of filtration is the Vietoris-Rips filtration (see Hausmann [2016]).Roughly speaking, a Vietoris-Rips filtration is obtained by "growing" open balls centered at every point of a given set in an n-dimensional space.Specifically, the filtration is built by increasing with time the radius of the balls and joining those vertices whose balls intersect forming new simplices.
This way, we say that a q-dimensional hole is born along the filtration when it appears and we say that a q-dimensional hole dies when it merges with another q-dimensional hole at a certain time during the construction of the filtration.The rule to decide which hole dies when they merge is the elder rule which establishes that younger holes die.So, births and deaths of all the holes are controlled over time in the filtration.One of the common graphical representations of births and deaths of the q-dimensional holes over time is the so-called (q-dimensional) persistence diagram which consists of a set of points on the Cartesian plane.A point of a persistence diagram represents the birth and the death of a q-dimensional hole.Since deaths happen only after births, all the points in a persistence diagram lie above the diagonal axis.Furthermore, those points in a persistence diagram that are far from the diagonal axis are candidates to be "topologically significant" since they represent holes that survive for a long time.The so-called bottleneck distance can be used to compare two persistence diagrams.
Definition 3 The (q-dimensional) bottleneck distance between two (q-dimensional) persistence diagrams Dgm and Dgm is: where α ∈ Dgm and φ is any possible bijection between Dgm ∪ ∆ and Dgm ∪ ∆, being ∆ the set of points in the diagonal axis.
An useful result used in this paper is the following one that connects the Gromov-Hausdorff distance between two metric spaces and the bottleneck distance between the persistence diagrams obtained from their corresponding Vietoris-Rips filtrations.For the sake of brevity, the (q-dimensional) persistence diagram obtained from the Vietoris-Rips filtration computed from a subset X of R n , with q ≤ n, will be simply called the (q-dimensional) persistence diagram of X and denoted by Dgm q (X).
Theorem 1 [Chazal et al., 2014, Theorem 5.2] For any two subsets X and Y of R n , and for any dimension q ≤ n, the bottleneck distance between the persistence diagrams of X and Y , Dgm q (X) and Dgm q (Y ), is bounded by the Gromov-Hausdorff distance of X and Y : Let us recall that the Hausdorff distance between X and Y is: where x ∈ X and y ∈ Y , and the Gromov-Hausdorff distance between X and Y is defined as the infimum of the Hausdorff distance taken over all possible isometric transformations Chazal et al. [2014].That is, where f : X → Z denotes an isometric transformation of X into some metric space Z and g : Y → Z denotes an isometric transformation of Y into Z.

Representative Datasets
As mentioned above, the key idea in this paper is the formal definition of representative datasets and the proof of their usefulness to reduce the resources needed to train a neural network without losing accuracy.Proving such a general result for any neural network architecture is out of the scope of this paper and, probably, it is not possible since the definition of neural network is continuously evolving over time as mentioned before.Due to such difficulties, we will begin in this paper by proving the usefulness of representative datasets in the perceptron case, and by experimentally showing their effectiveness in the case of multilayer neural networks.
To start with, in this section, we provide the definition of representative datasets which is independent of the neural network architecture considered.The intuition behind this definition is to keep the "shape" of the original dataset while reducing its number of points.Firstly, let us introduce the notion of ε-representative point.
x and there exists δ ∈ R n such that x = x + δ with δ ≤ ε, where ε ∈ R is the representation error.We denote x ≈ ε x.
The next step is to define the concept of ε-representative dataset.Notice that if a dataset can be correctly classified by a neural network, any isometric transformation of such dataset can be also correctly classified by the neural network (after adjusting the weights).Therefore, the definition of ε-representative dataset should be independent of such transformations.Moreover, the concept of λ-balanced ε-representative datasets is also introduced and it will be used in Section 5 to ensure that similar results are obtained when training a perceptron with a representative dataset instead of training it with the original one.
Finally, we will say that ε is optimal if ε = min{ξ : there exists a ξ-representative dataset of D}.
Proposition 1 Let D be an ε-representative dataset (with set of points X) of a dataset D (with set of points X).Then d GH (X, X) ≤ ε.
Proof: By definition of ε-representative datasets, there exists an isometric transformation from X to R n where for all x ∈ X there exists The definition of ε-representative datasets is not useful when ε is "big".The following result, which is a consequence of Proposition 1, provides the optimal value for ε.
Therefore, one way to discern if a dataset D is "representative enough" of D is to compute the Gromov-Hausdorff distance between X and X.If the Gromov-Hausdorff distance is "big", we could say that the dataset D is not representative of D. However, the Gromov-Hausdorff distance is not useful in practice because of its high computational cost.An alternative approach to this problem is given in Section 4.
Finally, the definition of dominating set is introduced, which will be used in the next subsection.
Definition 6 Given a graph G = (X, E), a set Y ⊂ X is a dominating set of X if for any x ∈ X, it is satisfied that x ∈ Y or there exists y ∈ Y adjacent to x.

Proximity Graph Algorithm
In this subsection, for a given ε > 0, we propose a variant of the Proximity Graph Algorithm Gonzalez-Diaz et al.
Firstly, a proximity graph is built over X, establishing adjacency relations between the points of X, represented by arcs.
See Fig. 1 in which the proximity graph of one of the two interlaced solid torus is drawn for a fixed ε.
Secondly, from a ε-proximity graph of X, a dominating set X ⊆ X is computed, obtaining an ε-representative dataset D = {(x, c x) : x ∈ X and (x, c x) ∈ D} also called dominating dataset of D. Algorithm 1 shows the pseudo-code used in this paper to compute the dominating dataset of D.

Algorithm 1 Dominating Dataset Algorithm
Here, DominatingSet(G ε (X c )) refers to a dominating set obtained from the proximity graph G ε (X c ).Among the existing algorithms in the literature to obtain a dominating set, we will use, in our experiments in Section 5.1, the algorithm proposed in Matula [1987] that runs in O(|X| • E).Therefore, the complexity of Algorithm 1 is O(|X| 2 + |X| • E) because of the size of the matrix of distances between points and the complexity of the algorithm to obtain the dominating set.
Lemma 1 The dominating dataset D obtained by running Algorithm 1 is an ε-representative dataset of D.
Proof: Let us prove that for any (x, c x ) ∈ D there exists (x, c x ) ∈ D such that x ≈ ε x.Two possibilities arise: 4 Persistent homology to infer the representativeness of a dataset In this section, we recall the role of persistent homology as a tool to infer the representativeness of a dataset.
Firstly, from Theorem 1 in page 4, we can establish that the bottleneck distance between persistence diagrams is a lower bound of the representativeness of the dataset.Lemma 2 Let D be an ε-representative dataset (with set of points X ⊂ R n ) of a dataset D (with set of points X ⊂ R n ).Let Dgm q (X) and Dgm q ( X) be the q-dimensional persistence diagrams of X and X, respectively.Then, for q ≤ n, d B Dgm q (X), Dgm q ( X) ≤ 2ε.
Proof: Since D is an ε-representative dataset of D then d GH (X, X) ≤ ε by Proposition 1. Now, by Theorem 1, As a direct consequence of Lemma 2 and the fact that the Hausdorff distance is an upper bound of the Gromov-Hausdorff distance, we have the following.
Corollary 2 Let D be an ε-representative dataset of D where the parameter ε is optimal.Let Dgm q (X) and Dgm q ( X) be the q-dimensional persistence diagrams of X and X, respectively.Then, In order to illustrate the usefulness of this last result, we will discuss a simple example.In Fig. 2a, we can see a subsample of a circumference (the original dataset) together with two classes corresponding, respectively, to the upper and lower part of the circumference.In Fig. 2c, we can see a subset of the original dataset and a decision boundary "very"different to the one given in Fig. 2a.Then we could say that the dataset showed in Fig. 2c does not "represent" the same classification problem than the original dataset.However, the dataset shown in Fig. 2b could be considered a representative dataset of the original one since both decision boundaries are "similar".This can be determined by computing the Hausdorff distance between the original and the other datasets, and the bottleneck distance between the persistence diagrams of the corresponding datasets (see the values shown in Table 1).Using Corollary 2, we can infer that 0.08 ≤ ε 1 ≤ 0.18 for the dataset given in Fig. 2b and 0.13 ≤ ε 2 ≤ 0.3 for the dataset given in Fig. 2c.Therefore, the dataset given in Fig. 2b can be considered "more" representative of the dataset showed in Fig. 2a than the dataset given in Fig. 2c, as expected.
(a) A binary classification problem given by a sampled circumference.In this case, the classification problem tries to distinguish between the upper and the lower part of the circumference.
(b) (ε1-Representative dataset) A subset of the sampled circumference given in Fig. 2a.Let us observe that the decision boundary obtained is similar to the one showed in Fig. 2a.
(c) (ε2-Representative dataset) A subset of the sampled circumference given in Fig. 2a.Let us observe that the decision boundary obtained is quite different to the one showed in Fig. 2a.
For the sake of simplicity, we will restrict our interest to a binary classification problem, although our approach is valid for any classification problem.Therefore, our input is a binary dataset D = (x, c x ) : x ∈ X ⊂ R n and c x ∈ {0, 1} .
A useful property of the sigmoid function is the easy expression of its derivative.Let σ m denote the composition Secondly, let us find the local extrema of (σ m ) by computing the roots of its derivative: In the following lemma, we prove that the difference between the outputs of the function y m w evaluated at a point x and at its ε-representative point x depends on the weights w and the parameter ε.
Lemma 4 Let w ∈ R n+1 and x, x ∈ R n with x ≈ ε x.Then, with z = min{wx, wx} and z = max{wx, wx}.
Proof: Let us assume, without loss of generality, that wx ≤ wx.Then, using the Mean Value Theorem, there exists β ∈ (wx, wx) such that and at z if z ≤ log(m), with z = min{wx, wx} and z = max{wx, wx}.Consequently, Applying now the Hölder inequality we obtain: Replacing w(x − x) by w * ε in Eq. ( 1), we obtain the desired result.
The following result is a direct consequence of Lemma 3 and Lemma 4.
As previously pointed out, we will use the stochastic gradient descent algorithm to train the perceptron, trying to minimize the following error function: where, for (x, c x ) ∈ D and w ∈ R n+1 , is the loss function considered in this paper.
Following the stochastic gradient descent algorithm, each iteration takes a random point of the dataset, being able to repeat the same point in different iterations.Let us assume that the point (x i , c i ) ∈ D is considered in the i-th iteration.Then, the weights w i = (w i 0 , w i 1 , . . ., w i n ) of the perceptron are updated according to the following rule: where η i > 0 is the learning rate.Let us now notice that for j ∈ {1, . . ., n} and x = (x 1 , . . ., x n ).Besides, By abuse of notation, the point x will be considered as a point (x 0 , x 1 , . . ., x n ) ∈ R n+1 being x 0 = 1.This way, Eq.
Now, using Eq.(3), Rule (2) can be written as follows: To assure the convergence of the training process to a local minimum, it is enough that, for each iteration i, the learning rate η i satisfies the following conditions (see [Goodfellow et al., 2016, Section 8.3.1]): The next result establishes under which conditions ε-representative points are classified under the same label as the points they represent.
Lemma 5 Let D be an ε-representative dataset of the binary dataset D. Let N w be a perceptron with weights w ∈ . Proof: First, if wx = 0 then ε = 0, therefore x = x and then N w (x) = N w (x).Now, let us suppose that wx < 0.Then, y w (x) < 1 2 and N w (x) = 0 by definition of perceptron.Since ε < wx w then x and x belong to the same semispace in which the space is divided by the hiperplane wx = 0. Therefore, wx < 0, then y w (x) < 1 2 and finally N w (x) = 0. Similarly, if wx > 0, then N w (x) = 1 = N w (x), concluding the proof.
By Lemma 5, we can state that if ε is "small enough" then the accuracy of the perceptron N w evaluated on D and D will coincide.Let us formally introduce the concept of accuracy.
Definition 9 The accuracy of the perceptron N w evaluated on the binary dataset D is defined as: where, for any (x, c x ) ∈ D, The following result holds.
Finally, I w (x) = I w (x) for all x ≈ ε x and (x, c x) ∈ D by Lemma 5 since ε < wx w for all (x, c x ) ∈ D.
Next, let us compare the two errors E(w, X) and E(w, X) obtained when considering the binary dataset D and its λ-balanced ε-representative dataset D.
Theorem 3 Let D be a λ-balanced ε-representative dataset of the binary dataset D. Then: where ρ m (being m = 1, 2) was defined in Lemma 4, and for each addend, x ≈ ε x. Proof: where, for each addend, x ≈ ε x.Applying Lemma 4 for m = 1, 2 to the last expression, we get: From this last result we can infer the following.We can always fix the parameter ε "small enough" so that the difference between the error obtained when considering the dataset D and its ε-representative dataset is "close" to zero.
Theorem 4 Let δ > 0. Let D be a λ-balanced ε-representative dataset of the binary dataset D. Let N w be a perceptron with weights w 4 and ρ 2 ≤ 8 27 by Corollary 3. Second, since c x ∈ {0, 1}, then we have: Applying Hölder inequality to the last expression we get: Therefore, by Lemma 3, if ε ≤ 54 43 w * δ, then E(w, X) − E(w, X) < δ as stated.Summing up, in the case of stochastic gradient descent, we have proved that it is equivalent to train a perceptron with the binary dataset D or with its λ-balanced ε-representative dataset.This fact will be highlighted in Section 5.1 for the perceptron case and in Section 6 for neural networks with more complex architectures.

Experimental Results
In this section, two experiments are provided to support our theoretical results for the perceptron case and to illustrate the usefulness of our method.In the first experiment (Subsection 5.1.1),several synthetic datasets are presented showing different possible casuistics.In the second experiment (Subsection 5.1.2),the Iris dataset is considered.
Besides, in the first experiment, random weight initialization is considered and the holdout procedure is applied (i.e. the datasets were split into training and test set) to test the generalization capabilities.In the other experiment, the perceptron is initiated with random weights and trained with three different datasets: the original dataset, a representative dataset (being the output of Algorithm 1) and a random dataset of the same size as the size of the representative dataset but the perceptrons are evaluated on the original dataset.These experiments support that a perceptron trained with representative datasets gets similar accuracy to a perceptron trained with the original dataset.Besides, we show that the training time, in the case of the gradient descent training, is lower when using a representative dataset, and that representative datasets ensure good performance while the random dataset provides no guarantees.The implementation of the methodology presented here can be consulted online in https://github.com/Cimagroup/Experiments-Representative-datasets.

Synthetic datasets
In this experiment, different datasets were generated using a Scikit-learn python package implementation3 .Roughly speaking, it creates clusters of normally distributed points in an hypercube and adds some noise.Specifically, we considered three different situations: (1) distributions without overlapping; (2) distributions with overlapping; and, (3) a dataset with a "thin" class and a high ε.In the last experiment, we wanted to show that the choice of ε is important, and that there are cases where representative datasets are not so useful.In all the three cases, the perceptron was trained using the stochastic gradient descent algorithm and the mean squared error as the loss function.
In the first case (see Fig. 3a), 5000 points were taken with two clusters well differentiated, i.e. without overlapping between classes.The 20% of the points were selected to belong to the test set and the rest of the points constituted the training dataset.Then, an ε-representative dataset of the training dataset was computed using Algorithm 1 with ε = 0.8, obtaining a dominating dataset with just 17 points.Similarly, 17 random points were chosen from the training dataset (see Fig. 3c).Later, a perceptron was trained on each dataset for 20 epochs and evaluated on the test set.The mean accuracy results after 5 repetitions were: 0.96 for the dominating dataset, 0.82 for the random dataset, and 0.98 for the training dataset.Besides, the random dataset reached very low accuracy in general.In the second case (see Fig. 3d), a dataset composed of 5000 points with overlapping classes was generated.As in the first case, the dataset was split into a training and a test set.Then, the ε-representative dataset of the training dataset was computed using Algorithm 1 with ε = 0.5, resulting in a dominating dataset of size 22.After training a perceptron for 20 epochs and repeating the experiments 5 times, the mean accuracy values were: 0.73 for the dominating set; 0.67 for the random  2: Evaluation metrics on the test set for a perceptron trained with the training datasets, the dominating datasets and the random datasets computed from the synthetic datasets showed in Figure 3.
set; and, 0.86 for the training set.Finally, in the third case (See Fig. 3g), one of the classes was very "thin", in the sense that the points were very close to each other displaying a thin line.Therefore, if a "big" ε were chosen, that class would be represented by a pointed line as showed in Fig. 3h where ε = 0.8, reducing the dominating dataset to 15 points.With this example, we wanted to show a case where representative datasets were not so useful.The perceptron was trained for 20 epochs and the mean accuracy of 5 repetitions were: 0.72 for the dominating set; 0.76 for the random set; and, 0.99 for the training set.In terms of time, the training for 20 epochs with the training set took around 20 seconds and the training with the dominating dataset took half a second.The computation of the dominating dataset took around 7 seconds.In Table 2, some evaluation metrics are provided.

The Iris Dataset
In this experiment we used the Iris Dataset4 which corresponds to a classification problem with three classes.It is composed by 150 4-dimensional instances.We limited our experiment to two of the three classes, keeping a balanced dataset of 100 points.Algorithm 1 was applied to obtain an ε-representative dataset of 16 points with ε ≤ 0.5 (called the dominating dataset).A random dataset extracted from the original dataset with the same number of points than the dominating dataset was also computed for testing.These datasets are represented in R 3 in Fig. 4a, Fig. 4b and Fig. 4c, respectively.Besides, the associated persistence diagrams are shown in Fig. 5a, 5b and 5c.The Hausdorff and the 0-dimensional bottleneck distances between the original dataset and the dominating and random datasets are given in Table 3.
We trained the perceptron with different initial weights and observed that the perceptron trained with the dominating and the original datasets converged to similar errors.In Table 4, the difference between the errors using a fix set of weights for the dominating and the random dataset is provided.In Table 5 different metrics were evaluated on the   4: Comparison between the exact error differences computed over the random and the dominating dataset for the Iris classification problem.The values correspond to the mean of the exact error differences and the bound obtained using 100 different random weights.

The Multi-layer Neural Network Case
In this section, we will check, experimentally, the usefulness of representative datasets for more complex neural network architectures.Two different experiments were made using synthetic datasets and digits images.The implementation of the methodology can be found online in https://github.com/Cimagroup/Experiments-Representative-datasets.

Synthetic Datasets
This experiment consists of two different binary classification problems from synthetic datasets with 5000 points.6: Time (in seconds) required to compute the dominating datasets using Algorithm 1 and time (in seconds) required for the training process on the different datasets.The training method consists of the gradient descent algorithm.In the case of the digits dataset, a multi-layer neural network was trained for 1000 epochs using the Adam training algorithm.dataset and a random dataset of the same size were computed and used for training a 3 × 12 × 6 × 1 multi-layer neural network.The neural network used ReLu activation function in the inner layers and sigmoid function in the output layer, and was trained using stochastic gradient descent and mean squared error as loss function for 20 epochs.
In the first case (see Fig. 6a), an unbalanced dataset with overlapping was considered, and a ε-dominating dataset was computed with ε = 0.8 composed of 67 points (see Fig. 6b).Then, a random dataset with the same size as the dominating dataset was considered.The neural network was trained and the mean accuracy values after 5 repetitions were: 0.85 for the dominating set; 0.74 for the random set; and, 0.86 for the training set.In the second case, a balanced dataset with overlapping was considered (see Fig. 6d), and the same process as in the first case was followed but with ε = 0.3, obtaining a dominating set of size 319.Then, the mean accuracy values after 5 repetitions were: 0.92 for the dominating set; 0.91 for the random set; and, 0.93 for the training set.In Table 7, different evaluation metrics on the test set for the two cases are shown.

The Digits Dataset
The dataset5 we will use in this experiment consists of images of digits classified in 10 different classes corresponding to digits from 0 to 9.An example of an image of each class can be seen in Fig. 7.The dataset is composed by 1797 64-dimensional instances.Algorithm 1 was applied with ε = 0.2 to obtain a dominating dataset of size 173.The corresponding persistence diagrams can be seen in Fig. 5d, Fig. 5e and Fig. 5f.The Hausdorff and the bottleneck distances are shown in Table 3.In this case, we used a neural network with 64 × 400 × 300 × 800 × 300 × 10 neurons with sigmoid activation function in the hidden layers and softmax activation function in the output layer.The neural    5 where the dominating dataset outperforms the random dataset.Finally, in Table 8, different values for ε were used to compute the size of the dominating dataset and its accuracy on itself and on the original dataset.8: Mean accuracy values after 5 repetitions of the neural network trained on the dominating dataset and evaluated on both, the dominating dataset and the original dataset for the Digits classification problem.

Conclusions and future work
The success of practical applications and the availability of new hardware (e.g., GPUs Gu et al. [2017] and TPUs Jouppi et al. [2017]) have led to focus neural network research on the development of new architectures rather than on theoretical issues.Such new architectures are one of the pillars of future research in neural networks, but a deeper understanding of the data structure is also necessary for field development, such as new discoveries on adversarial examples Yuan et al. [2019] have shown, or the one given in Toneva et al. [2019], where the redundancy of several datasets is empirically shown.
In this paper, we propose the use of representative datasets as a new approach to reduce learning time in neural networks based on the topological structure of the dataset.Specifically, we have defined representative datasets using a notion of nearness that has the Gromov-Hausdorff distance as the lower bound.Nevertheless, the bottleneck distance of persistence diagrams (which is a lower bound of the Gromov-Hausdorff distance) is used to measure the representativeness of the dataset since it is computationally less expensive.Furthermore, the agreement between the provided theoretical results and the experiments supports that representative datasets can be a good approach to reach an efficient "summarization" of a dataset to train a neural network.First, in the case of gradient descent, the training process is significantly faster when using a representative dataset.

Figure 1 :
Figure 1: A point cloud sampling two interlaced solid torus and the ε-proximity graph of one of them for a fixed ε.

Theorem 2
Let D be a λ-balanced ε-representative dataset of the binary dataset D. Let N w be a perceptron with weights w ∈ R n+1 .If ε ≤ min wx w : (x, c x ) ∈ D then A(D, N w ) = A( D, N w ).Proof: Since D is λ-balance ε-representative of D, then -D˝= λ • | D| and we have:

Figure 3 :
Figure 3: Different synthetic datasets generated using the Scikit-learn python package implementation.The first column corresponds to original datasets, the second column corresponds to dominating datasets of the original datasets, and the third column corresponds to random subsets of the original datasets of the same size as the corresponding dominating set.

Figure 4 :
Figure 4: Visualization of the Iris dataset: the original dataset composed of 100 points, the dominating dataset, and the random dataset composed of 16 points.
(a) (Iris dataset) Persistence diagram of the dataset given in Fig. 4a.(b) (Iris dataset) Persistence diagram of the dataset given in Fig. 4b.(c) (Iris dataset) Persistence diagram of the random dataset given in Fig. 4c.(d) (Digits dataset) Persistence diagram of the dataset used in Section 6.2.(e) (Digits dataset) Persistence diagram of the dominating dataset used in Section 6.2.(f) (Digits dataset) Persistence diagram of the random dataset used in Section 6.2.

Figure 5 :
Figure 5: Persistence diagrams of the set of points in the datasets used/computed in the experiments.

Figure 6 :
Figure 6: Different synthetic datasets generated using the Scikit-learn python package implementation.The first column correspond to the original datasets, the second column are dominating datasets computed from the original dataset, and the third column are random subsets of the original data of the same size as the corresponding dominating set.

Figure 7 :
Figure 7: (Digits dataset) Example of an image of each class of the digits dataset.They are 32 × 32 arrays in gray scale.

Table 3 :
The Hausdorff and the 0-dimensional bottleneck distances of two different datasets (named iris and digits) and their corresponding dominating and random datasets.originaldataset when training with the original dataset, the dominating set, and the random set.The table shows how the dominating dataset provides better metrics than the random dataset.

Table 5 :
The datasets were split into a training set and a test set with proportions of 80% and 20%, respectively.Then, a dominating MSE The different metrics provided for the iris and the digits dataset experiments are provided.Both are the mean values of 5 repetitions of the experiments and are evaluated on the full original dataset.

Table 7 :
Evaluation metrics on the test set for the training of a multi-layer neural network with the training set, the dominating set and the random set of the synthetic dataset experiment.