Random-based networks with dropout for embedded systems

Random-based learning paradigms exhibit efficient training algorithms and remarkable generalization performances. However, the computational cost of the training procedure scales with the cube of the number of hidden neurons. The paper presents a novel training procedure for random-based neural networks, which combines ensemble techniques and dropout regularization. This limits the computational complexity of the training phase without affecting classification performance significantly; the method best fits Internet of Things (IoT) applications. In the training algorithm, one first generates a pool of random neurons; then, an ensemble of independent sub-networks (each including a fraction of the original pool) is trained; finally, the sub-networks are integrated into one classifier. The experimental validation compared the proposed approach with state-of-the-art solutions, by taking into account both generalization performance and computational complexity. To verify the effectiveness in IoT applications, the training procedures were deployed on a pair of commercially available embedded devices. The results showed that the proposed approach overall improved accuracy, with a minor degradation in performance in a few cases. When considering embedded implementations as compared with conventional architectures, the speedup of the proposed method scored up to 20× in IoT devices.


Introduction
Edge computing and Internet of Things (IoT) are crucial areas in modern electronics [26,42], involving important domains such as healthcare [39,41], intelligent transportation [40], and multimedia communications [38]. Deep learning paradigms [14] prove effective in those applications, but resource-constrained devices cannot support the training process [19], and even deploying trained models in embedded systems still remains a challenging task.
Traditional approaches such as single-layer feed-forward neural networks (SLFNNs) and support vector machines (SVMs) can be trained by involving a relatively small amount of computational resources. Random-based networks (RBNs) such as random radial basis functions [28], random vector functional link (RVFLs) [31], extreme learning machines (ELMs) [17,18], and weighted sum of random kitchen sinks [36] offer interesting opportunities. The major advantage of the latter paradigms is that the training process requires to solve a system of linear equations, and can therefore be supported by limited resource devices. In addition, the small number of hyper-parameters that characterize those models reduces the complexity of model fitting. As a result, this approach might provide a viable option for custom ad hoc applications, featuring the capability of automatic tuning in compliance with the users' needs.
Several, effective solutions have been proposed in the literature for a variety of applications based on these models [5,22,52], yet the deployment of stand-alone solutions on inexpensive, resource-constrained devices still remains tricky [20], for a pair of reasons.
First, the proposed designs mostly rely on reconfigurable platforms such as field programmable gate arrays (FPGAs) [9,10,37,50], which may prove quite expensive. By contrast, implementations on micro-controllers or microcomputers have drawn limited attention, in spite of the fact that these devices best fit IoT applications and remarkably shrink the time-to-market of commercial products [1,17].
Second, the existing approaches in the literature aimed to improve the generalization capabilities of RBNs, by including some strategy to select effective neurons in the eventual predictors. This often went with a parallel increase in computational costs. For instance, an optimization problem favored sparse solution and removed ineffective neurons [29]; multiple sparse regression and leave-one-out mechanisms took out the least informative neurons. The pruning process proved computationally demanding and added some hyper-parameters to the underlying optimization problem. Likewise, recent attempts [11,34] to reduce the number of inactive neurons by a light model selection scheme brought about some increase in the computational cost of training. Biologically inspired optimization stimulated self-adaptive evolutionary ELM [4], dolphin swarm ELM [45], genetic ensemble of ELM [47], particle swarm optimization-based ELM [46], and artificial immune system-based ELM [44]. These approaches all adopted nature-inspired strategies to enhance the classifiers generalization abilities, but overall proved computationally demanding, especially because they mostly involved non-convex optimization problems.
Recent advances in deep learning models [43] introduced novel regularization techniques that improved over traditional methods and boosted specific applications (e.g., smart IoT devices). Within that context, dropout regularization is a popular technique for deep network training [43]: the underlying idea is that a network should represent an input sample in several ways, thus yielding a robust representation of the sample itself. This is attained by switching off a varying subset of neurons during each iteration of the gradient descent optimization algorithm. This mechanism has been applied successfully to ELMs [21] by adding a regularization term to the basic cost function, again with a consequent increase in the computational complexity of training. In [51], an ensemble implemented a dropout mechanism and applies fuzzy logic to combine the outputs of the individual classifiers. Similar approaches [23,25,27] combined ensemble mechanisms with random-based networks. These works privileged the predictors generalization performances over the computational costs of training.
This paper describes a hardware-friendly dropout training strategy for RBNs, whose targets are resource-constrained devices with non-varying hardware architectures, such as micro-controllers. As compared with the existing literature, the paper presents a procedure that limits the computational cost of the training phase on these devices. The proposed approach determines the eventual linear predictor for a network with N hidden neurons by merging an ensemble of Q linear predictors; each element of the ensemble is a network holdingÑ\N neurons. Figure 1 outlines the underlying architecture. The contribution of each neuron (i.e., its weight) in the eventual network results from the summation of the non-dropped corresponding weights in each sub-network.
As compared with the methods cited above [23,25,27,51], this schema exhibits a hybrid ensemble/dropout training procedure, which best exploits the regularization properties of the dropout process, while limiting the computational cost of the training process.
The empirical validation on a set of eight well-known benchmarks and three real-world datasets recently used to test IoT algorithms confirmed that the proposed method could yield a satisfactory trade-off between generalization performances and hardware requirements for training. To prove the effectiveness of the electronic design, the training algorithm was implemented on a pair of low-power, resource-constrained devices, namely the Broadcom BCM2837B0 Quad-core Cortex-A53, and an Allwinner H3, Quad-core Cortex-A7.

Contribution
The major contributions of the approach described in this paper can be summarized as follows: • A novel training algorithm for RBN classifiers, featuring a low-cost optimization procedure that yet preserves the generalization ability of the eventual predictors. • A design strategy to support the training of RBNs in embedded devices. • For a given training set and a specified network size, an analytical formulation of the cost function yields a Fig. 1 Schematic description of the low-cost training procedure preliminary assessment of the required number of operations when implemented in commercial devices. • The demonstration of the method effectiveness in two main stream devices for IoT applications and 11 realworld benchmarks.
The paper is organized as follows: Sect. 2 reviews both the ELM model and the dropout regularization scheme. Section 3 illustrates the novel training strategy, also in comparison with related works. Sections 4 and 5 report on the experimental results, whereas some concluding remarks are made in Sect. 6.

Extreme learning machine
The ELM model features a vast, long-standing literature within the existing RBN approaches. Let X be the input domain (typically, X 2 R D , D 2 N þ ), while T ¼ fðx; yÞ i ; x 2 X; y 2 fÀ1; 1g; i ¼ 1; . . .; Zg denotes a labeled training set drawn i.i.d. from a fixed, unknown distribution, P.
The parameters of the hidden layer ðb j 2 R D ; b j 2 RÞ; j ¼ 1; . . .; N are random; hence, the layer implements a fixed mapping of the input space, X , into R N . The ELM training process optimizes the output weights x 2 R N by solving a regularized least square (RLS) problem in the remapped space [17]. Let H denote the activation of the hidden layer as a Z Â N matrix, where h ij ¼ h j ðx i ; b j ; b j Þ is the activation of the jth neuron for the ith input sample and Z is the number of samples in the training set. Then, the associate learning problem can be expressed as min x fky À Hxk 2 þ kkxk 2 g ð1Þ When Z N, one has: Conversely, when Z [ N, one has: In the following, it will be assumed Z [ N without loss of generality. The eventual classifier can be written as

Dropout regularization
The dropout technique was first introduced in [43] to tackle the overfitting problem, which affects the training of fully connected networks when adopting iterative algorithms. For a single-layer, fully connected network trained with a gradient descend algorithm, the quantity H t will denote the activation of the hidden layer at the tth iteration of the training process. The dropout procedure ignores a subset of neurons at each training iteration that subset varies from one iteration to another. A vector m 2 f0; 1g N of Bernoulli random variables keeps track of such mechanism: at each iteration, the variable m j {j = 1,...,N} takes on the values f1; 0g with probabilities p, 1 À p, respectively. At iteration t, the vector m t is drawn at random, and it multiplies element-wise the columns of matrix H t . Thus, the contributions of all neurons whose multipliers in m t are zero nullify for that iteration.
A major advantage of dropout is that the resulting network actually represents the (sparse) average of an ensemble of smaller networks; this notably simplifies the overfitting problem. Moreover, the dynamic exclusion mechanism makes it less likely that an input sample strictly corresponds to one specific neuron of the network. As a consequence, the solution of the optimization problem gets more robust even when a specific neuron is removed.

Dropout extreme learning machine
Iosifidis et al. [21] recently applied the concept of dropout regularization to extreme learning machines and proposed an augmented loss function to set the learning process: The first term in the expression (5) aims to regularize the output weights x and takes into account the geometrical information of the input space through the use of matrix S; this term is not related to the dropout mechanism. The conventional mean square error characterizes the second term. The third term takes into account the N T sub-networks; it includes the matrixĤ ¼ H À H 0 , where H 0 is obtained from matrix H by setting a set of rows to 0 and is expected to introduce benefits of dropout mechanisms into the eventual classifier. The weighting hyper-parameters c and k balance the contributions of the corresponding terms. The minimum of (5) is written as where Neural Computing and Applications (2021) 33:6511-6526 6513 The summation becomes an expectation when N T ! 1; hence, one has where P ¼ ½ðp T pÞ 1 T 1 À I þ ½ð1 T pÞ I; here p is a row vector with the dropout probability of each neuron, 1 is a row vector whose elements are all equal to 1, and denotes the element-wise product. The formulation (5) leads to a closed-form solution: It is worth noting that-in terms of asymptotic computational complexity-this solution exhibits the same complexity of the expression (3). Section 3.2 analyzes the associate computational costs in detail.
3 Constrained ensemble-based training procedure

Dropout and local ensemble for efficient training
The first step of the training procedure complies with the basic ELM model: a set of N neurons remaps the training set, T , into a space H 2 R ZÂN . The quantities Z and N set the size of the matrix H. The constrained ensemble-based approach now applies a sub-sampling of the matrix H. At each step q of the subsampling process, a subset of rowsZ and columnsÑ are drawn at random and considered to evaluate the cost function. The idea of using a subset ofÑ neurons introduces the dropout mechanism. This leads to a reformulated training problem: The reduced vector of target classes,ỹ q , is obtained by removing from y the target values associated with the rows that have been disregarded in H. Reducing the size of the input set clearly speeds up training. In addition, by subsampling the training set one limits the correlation among predictors, thus enhancing the generalization performance of the overall ensemble.
To clarify the advantages of the sub-sampling process, one might rewrite the optimization problem (10) in terms of the full matrix: In the matrices H 0 q 2 R ZÂN and y 0 q 2 R Z , the elements corresponding to the rows/columns disregarded in the qth sub-sampling iteration (10) nullify. Thus, the training process (11), characterized by explicit sparseness, involves a matrix H 0 q with size Z Â N. By contrast, the formulation (10) just requires to train Q predictors, each associated with one of Q sub-problems. To obtain the solution x 0 q of the complete problem (11), one augments each vectorx q by introducing null elements in correspondence of those neurons that had been disregarded in the sub-sampling step. Figure 2 illustrates the computation of x 0 q from x q , in a demo network with N ¼ 8 neurons. In the example, the sub-sampling procedure shrinks the number of neurons tõ N ¼ 5, and the training algorithm works out the predicted valuesx q for the 5 neurons. The overall predictor x 0 q is then worked out by padding the solution with zeros in correspondence of the indices that have been excluded by the sampling process.
The eventual predictor for the explicit space H is obtained by summing up the Q linear predictors x 0 q : The overall vector x Ã is assembled at the end of the training process; hence, the computational cost of the network training during the inference phase remains unaffected, as compared with the basic ELM model. Although the eventual predictor might appear similar to a local ensemble, as in dropout regularization [43], the proposed solution exhibits a few computational advantages. First, the Q sub-networks all share the same input space; as a result, the Q matricesH q can all be derived from H, which is only computed once. This reduces the Fig. 2 Example showing the sub sampling mechanism and the conversion fromx q load that would be required by the explicit mapping ofH q in each predictor. In addition, the overall prediction results from the activation of one network rather than by adding the predictions of all sub-networks. As a consequence, the eventual prediction inherently embeds a weighting mechanism that characterizes refined ensemble strategies.
Algorithm 1 outlines the overall training procedure. Figure 3 illustrates Algorithm 1 in a graphic form. The three boxes in the graph correspond to the three steps of the algorithm. After remapping the training data (Mapping box in the figure), the matrix H is sub-sampled Q times and the Q optimization problems are solved independently (Learning box). The schema highlights the parallel computations of the simpler Q problems, which can therefore be supported by resource-constrained devices by means of multiprocessing. The final step (Output box) merges the individual linear separators to work out the overall predictor x Ã .

Analysis of computational cost
Several issues hinder the training of a SLFNN on resourceconstrained devices. The expression (3) sets the computational cost for both the standard L2 regularized ELM and the proposed approach. Nonetheless, solving (9) involves Learning for q=1; q=<Q; q++ do extract random sampleH ∈ H,ỹ ∈ y train a linear predictor:ω q = (λI +H TH ) −1H Tỹ compute equivalent solution ω 0 q end for 3. Output compute linear predictor in the space q an additional computational overhead; at the same time, memory occupation and latency bring about the major constraints and are affected by multiple factors.

Input data remapping
The first step in RBN training is the remapping of the input data, which can be formalized as where b 2 R ZÂN and b 2 R NÂ1 are the network parameters, and the repmat operator just appends the bias value, b, to each training datum. The number of multiplications and additions scales as OðZ Â N Â DÞ. When adopting linear approximations [30], the computation of N Â N nonlinear terms requires a minimum of 2 Â N Â N additional operations. The number of operations increases if one involves more accurate approximations.

Optimization
Then, one tackles the actual optimization problem in the remapped space. Two main sub-steps determine the associate computational time: the matrix multiplication H T H and the solution of the associate system of linear equations. Different strategies can apply depending on the hardware resources available: for example, the matrix multiplication can be carried out in parallel.
In the ideal case of unbounded computational resources, the matrix multiplication can complete in two clock cycles: first, a set of Z parallel HW units carry out individual inner products (to compute each element of the result); then, the resulting individual terms feed an adder circuit having Z inputs. By using N 2 of such product/adder blocks and assuming that each multiplication/addition completes in 1 clock cycle, the overall process completes in 2 clock cycles. Such an unrealistic solution just sets an upper bound to timing performance. Conversely, in an opposite, worst-case HW configuration including one floating point unit, the best known computational bound is OðN 2:37286 Þ [24] for a pair of square matrices. Again, such a setup seems unrealistic because the largest term in the computational cost scales as aN 2:37286 , where a [ [ N 2:37286 for reasonable values of N. As a consequence, the method proposed in [24] becomes convenient only when the matrices are asymptotically large. The literature offers several practical approaches, based on the number of computational units, memory structures, and memory size. Conventional solutions rely on the Strassen matrix multiplication algorithm, which scales as OðN 2:807 Þ [16]. A speedup is obtained, for square input matrices, when the matrix size is larger than 100; that algorithm also scales efficiently for the rectangular matrices, H, that characterize the training process.
The solution of a linear equation system is less prone to a parallel approach. Existing algorithms scale as the third power of the number of variables (i.e., the number of neurons N). The literature shows that Singular Value Decomposition (SVD) [17] yields satisfactory numerical solutions, whose computational cost can be roughly approximated by 12N 3 [13]. The linear equation system in Eq. (3) just involves a matrix ðH T H þ kIÞ that is Hermitian and reasonably well conditioned. This allows to adopt Cholesky decomposition as a reference model in terms of memory and computational cost. This procedure scales as N 3 =3 and proves more efficient than the conventional LU factorization 2N 3 =3. A forward and backward substitutions are eventually required to complete the procedure, introducing 2N 2 additional operations.

Model selection
The model selection strategy heavily affects the overall cost of the training process. A naive approach to model selection would require to iterate a number of training procedures, each characterized by as many settings of the hyper-parameters.
When considering k, the pair of matrix multiplications H T H and H T y in the expression (3) need not be recomputed for different values of k. Likewise, by using SVD one need not work out again the matrices U, S, V, since the summation with the diagonal matrix kI only affects the elements of S [13]. SVD efficiently supports spectral regularization techniques [12], as well. On the other hand, the Cholesky algorithm requires to carry out the complete procedure for each setting of k.
The number of trials required to set the number of neurons, N, is critical to determine the computational cost. The eventual procedure benefits from the properties of matrix multiplications, which allow one to avoid the recomputation of the whole matrix H T H as part of the learning process [13].
In typical on-board IoT applications, the number of neurons and the size of the training set are not asymptotically large due to memory constraints. As a result, the overall approximation of the cost function needs to be fine grained. In the case of L2-regularized ELMs, the expression (14) gives an adequate approximation of the number of floating point operations required for training: where linsolve operator embeds the solution of the linear equation system that is identified by (a Á N n þ c) and k is the number of different settings for the hyper-parameter k.
For each value of k, the predictor is computed over the validation set, marked with the subscript ''val'' in (14). The proposed equation considers all the matrix-matrix and matrix-vector operations involved in the solution of the learning problem (1). One might argue that (14) does not take into account the model selection for the parameter N; this seems a reasonable assumption when addressing resource-constrained devices, where N is limited due to hardware constraints. The computation of (9) also requires to work out matrices P and S. Moreover, the model selection procedure optimizes three hyper parameters: k; c and p. As a consequence, the expression (14) actually sets a lower bound to the computational complexity of both (9) and (6).

Overall computational cost
In the novel training procedure proposed in this paper, the mapping phase matches the basic model. Instead, the augmented cost function (11) involves the second phase of the training process after remapping. Thus, the overall computational cost of the approach proposed in this paper can be written as whereÑ andZ are the size of the sub-sampled sets of neurons and training data, respectively.
The expression (15) shows that the proposed approach is most effective when Q is small. The term Q affects the overall cost almost linearly, whereas the impact of quan-titiesZ\Z andÑ\N is quadratic and cubic, respectively. In this regard, Fig. 4 shows the behavior of the cost function (14) for different values of the parameters Z and N. The graph confirms that the computational cost rapidly decreases as Z and N decrease.
Interestingly, the eventual speedup of the training phase presented in [51] becomes marginal as compared with that obtained by the approach proposed in this paper. Furthermore, the forward phase of the predictors presented in [51] requires time-consuming operations such as sorting. As a result, the predictor [51] proves more computationally demanding than traditional SLFNNs.

Comparison with related works
Ensemble approaches combined with RBNs [23,25,27] usually aim to enhance generalization performance disregarding the associate computational cost. State-of-the-art works do not set constraints on the number of neurons of the overall ensemble. As a consequence, both memory occupation and latency of the inference phase increase. The solution proposed here, instead, aims to balance generalization performances and computational cost: it sets the size of the hidden layer, N, a priori, based on the available memory. Then, individual learners all sample the random neurons from the same pool. Finally, learners merge into a single network, i.e., the eventual classifier, having size N. As a consequence, the inference phase has the same computational cost of standard feed-forward networks with N neurons.
The proposed approach outperforms state-of-the-art dropout-based methods in terms of computational cost. In [51], an additional fuzzy logic affected the computational complexity of the training phase. Furthermore, the forward phase of the predictors required complex operations such as sorting; hence, the classifier proved less efficient than traditional SLFNNs.
In [21], the authors introduced a novel regularization term, and the cost function still involved the solution of a linear equation system. This operation roughly scales as N 3 . Conversely, in the proposed approach the training procedure involves a subset of Q problems that scales as N 3 . The results presented in Sect. 4 will confirm that satisfying results can be achieved with Q ÂÑ 3 \\N 3 . In addition, Algorithm 1 highlights a crucial difference with respect to the approach presented in [21], where the authors focused on the minimization of (5) and forced the solution x to be valid for any sub-network (obtained by dropout procedure) and the original complete network. By contrast, the procedure proposed in this work obtains the solution by combining a set of independent learners. Using a subset of training data, one might affect the regularization properties of the dropout method. At the same time, the Q-independent learners are expected to be orthogonal to a certain extent. This in turn should increase the generalization ability of the eventual ensemble and limit the risk of overfitting accordingly [33].
When considering electronic implementations, the method proposed in this paper addresses the efficient deployment on inexpensive devices. The related design approaches typically aimed efficient digital implementations of RBNs in configurable architectures. A very largescale integration (VLSI) architecture was the main target in [3,6]; the approach presented in [32] envisioned analog implementations and combined a tri-state activation function with an offline pruning procedure to limit the predictor complexity. The models proposed in [9,10,37,50] targeted FPGA implementations of the learning phase, either online or in batch mode. Conversely, Decherchi et al. [7] and Ragusa et al. [35] proposed a minimal implementation of the forward phase of RBNs, while [32,49] introduced an effective scheme to reduce the memory requirements of the eventual predictors.

Generalization performances
To evaluate the generalization effectiveness of the proposed method, an experimental setup simulated a real usecase, in particular the size of the mapping layer was bounded by hardware constraints. Table 1 gives the main features of the 11 involved benchmarks, which were arranged in two sets: standard benchmarks and IoT benchmarks.
The proposed approach was compared with a pair of computational demanding solutions, namely an ELM with L2 regularization [18] and a dropout ELM [21]. Those algorithms represented the most interesting solutions in terms of trade-off between computational cost, generalization performance, and impact of the hyper parameters.
In the following, the presentation of the experimental results will always adopts a common format: for each experiment, a pair of sub-figures (a) and (b) will give the results obtained for different settings of the size, N, of the hidden layer. In all tables, rows will correspond to the size of the training sub-setZ, whereas columns will refer toÑ. The table rows/columns will give the settings with respect to the reference values, Z and N, respectively. Therefore, the topmost row will mark a predictor trained withZ ¼ 0:9Z and the leftmost column will indicateÑ ¼ 0:5N. The title of each table will give the percentage classification error, expressed in the range [0, 1] scored by both the dropout regularized solution (Ios) [21] and the L 2 regularized network (L2), all holding N neurons. Table cells will give the discrepancy between the test error (averaged over 100 iterations) scored by the proposed method and the error attained by Ios regularized method: where T i;j is the table element, Ios is the test error of the baseline [21] method using the nth random extraction of the hidden layer, and Proposal n ði; jÞ is the test error scored by the proposed method using the setting i, j and the pool of neurons belonging to the nth random hidden layer.
Positive values will be characterized by a green cell background and will indicate that the hardware-friendly dropout strategy scored better results. The cells having a red background, instead, will mark those tests in which the proposed method did not outperform conventional ELMs. Yellow background cells will denote those settings in which the discrepancy between the comparisons was marginal (less than 1%).
The statistical significance of the results was measured considering the weak law of large numbers. All measures marked in green and red were statistically significant because jT i;j j [ j2r Ios j þ j2r Proposal j, where r Ios and r Proposal are the standard deviations of Ios and Proposal, respectively.
Each experiment involved: two settings of the hidden layer size N ¼ f200, 1000g, 13 values of the hyper-parameter k ¼ 10 i ; i ¼ fÀ6; À5; . . .; 6g, 3 values of the parameterÑ ¼ f0:5; 0:3; 0:1gN, and 5 values for parameter Z ¼ f0:9; 0:7; 0:5; 0:3; 0:1gZ. The datasets were always drawn by using a balanced configuration, in which the number of patterns was normalized to the least numerous class. Generalization performances were measured by using standard hold-out procedure. The training set was extracted by using 70% of the training data, whereas the validation and test set included 20% and 10% of the data, respectively. The parameter k was set accordingly to the best result scored on the validation set. Generalization performances are reported after measurements on test data, i.e., data that had never been used during either training or model selection. The following subsections illustrate the outcomes of the experiments for standard machine learning benchmarks and on IoT specific benchmarks. Setting the iteration parameter Q = 10 limited the size of the experimental section. Indeed, that setting corresponded to the smallest value that proved sufficient in all the benchmarks to obtain stable results. This choice did not affect the validity of the results obtained on the test sets. As a matter of fact, the implementation analysis always considered a worst-case scenario because, in most cases, the smallest value of Q that reached good performances was smaller than 10.
The reported results aim to confirm that the proposed algorithm could yield satisfactory accuracy values (as compared with the reference approaches) but at a smaller computational cost of the training process.

Standard machine learning benchmarks
The first set of experiments included 8 popular benchmarks for machine learning, all drawn from the UCI repository [8], mostly to allow fair comparisons with existing approaches.
The results on the QSAR dataset (Fig. 5) scored a limited gap (less than 3%) in performance between the proposed approach and the two baseline comparisons. Remarkably, the hardware-friendly dropout strategy proved equivalent or even more effective in terms of generalization performances.
The data presented in Fig. 6 refer to the Pima-Indians dataset. The results were similar to those shown in Fig. 5 and confirmed that the proposed dropout scheme could limit overfitting significantly in the presence of small subnetworks. Remarkably, this configuration was most convenient in terms of hardware requirements. Figure 7 gives the results obtained on the ''Default of credit card clients'' dataset. Although the proposed strategy seemed to yield lower performances, the gap always kept quite small; it was smaller than 0:1% in six cases and lower than 1% in the majority of the others. The minor degradation in performances was largely compensated by the hardware effectiveness of the supporting architecture.
The results on the Ozone dataset (Fig. 8) highlight the crucial trade-off between the performances of the base classifiers (i.e., the sub-networks trained independently) and those of the overall eventual predictor. Small subnetworks trained on small data chunks usually obtained unsatisfactory results (up to 6% of error increment). Conversely, when the sizes of the sub-networks and of the data chunks increased, the performances of the eventual classifiers got comparable or even better with the baseline solutions, with gaps smaller than 1%.
The tests on the Ionosphere dataset (Fig. 9) exhibited a similar trend forZ. Such a behavior was due to the fact that this benchmark held a limited number of samples; conversely, when %Z exceeded 30%, the proposed approach proved more convenient.
The differences between the comparisons were almost negligible in all the configurations for the HTRU dataset (Fig. 10). The gap was negligible also in profitable hardware configurations, i.e., when bothZ andÑ took on small values.
When tested on the Blood dataset (Fig. 11), the present approach proved to be the best solution in almost all configurations. Finally, the results on Australian Credit Card dataset (Fig. 12) confirmed that the proposed solution could attain performances that always closely approximated those achieved by the baseline algorithms.

Internet of Things benchmarks
These benchmarks belonged to a corpus of recent machine learning papers for IoT, and covered a wide spectrum of  configurations, ranging from small size to medium-/highsize problems. The MNIST dataset addressed the recognition of handwritten digits; as in previous works [7], the research presented here used a reduced version of that dataset, including 1000 patterns represented by 9 Â 9 grey-scale images. The bi-class classification problem involved the (most difficult) discrimination task between digits ''3'' and ''8.'' A similar setup was recently adopted in [48], presenting an IoT learning algorithm for visual patterns.
Distributed smart space orchestration system (Ds2os) was the second IoT-related benchmark and included a collection of traces captured in a networking domain for IoT. 1 The data had been collected from the application layer; hence, they differed significantly from the conventional feature-based patterns used by network-traffic classifiers. The main dataset included various sources, such as light controllers, thermometers, person detection sensors, washing machines, batteries, thermostats, smart doors, and smart phones. In compliance with the comparative approach proposed in [15], the binary problem set in this paper discriminated normal activity vs anomalies.
The freezing of gait (Fog) dataset [2] held the annotated readings of 3 acceleration sensors (positioned at the hips and legs) of patients affected by the Parkinson disease, who could experience freezing of gait during walking tests.
The results for the IoT benchmarks are reported in Figs. 13, 14 and 15. For the MNIST database, experimental outcomes in Fig. 13 confirmed the method effectiveness. The baseline comparisons only performed better in the case %N ¼ 0:1, whereas in most cases, the differences in performances never exceeded 1%. On the other hand, the dropout strategy for IoT devices featured a considerable speedup of the learning procedure.
The results in Fig. 14 for the Ds2os benchmark highlight the role of the pool size, N. In the configurations involving N ¼ 200 with small values ofÑ, the proposed hardwarefriendly method exhibited a non-negligible loss in term of accuracy. This drawback almost vanished in configurations with N ¼ 1000, as the loss in accuracy always kept smaller than 1%, with the already remarked advantage of a smaller computational cost.
The results for the Fog dataset (as per Fig. 15) showed a similar trend. The loss in accuracy proved significant when N ¼ 200 andÑ\0:5N. In practical scenarios, however, involving networks with a consistent set of neurons, the accuracy values matched those attained by standard approaches. From this viewpoint, the two basic procedures L 2 and Ios scored an improvement in accuracy of about 4% when the pool size increased from 200 to 1000. In the latter setup (N ¼ 1000), the proposed approach outperformed the reference comparisons.

A summary of generalization results
Overall, generalization performances proved comparable with the results reported in the literature, although a direct, thorough comparison could not always be accomplished due to differences in the experimental setups.
The outcomes of the experimental session can be summarized as follows: • In the experiments involving the standard benchmarks, the proposed approach attained generalization performances comparable with those scored by the baseline algorithms, (i.e., L2-regularized and dropout ELMs); the latter comparisons, however, proved much less efficient in terms of computational complexity. • The experiments involving IoT benchmarks confirmed that trend, with the only exception of the Distributed smart space orchestrian system dataset. The gap between the configurations with N ¼ 200 neurons and those holding N ¼ 1000 neurons seems to suggest that this issue might be solved by using a larger hidden layer.

Implementation analysis
The implementations on embedded architectures involved a pair of popular, commercially available devices for IoT applications, that is, the Broadcom BCM2837B0 Quad-Core Cortex-A53 (characterized by 1.4 GHz clock frequency, 32 kB L1 e 512 kB L2), and an Allwinner H3, Quad-Core Cortex-A7 (clock up to 1.2 GHz, equipped with 1 GB and 512 MB of RAM). These devices were selected because they supported well-known IoT devices, namely the Raspberry Pi 3b? and the NanoPi NEO AIR-Friendly ARM.
The experimental setup took into account the impact of the quantitiesZ andÑ on the latency of the proposed method in IoT applications. The campaign simulated the training process by using a toy procedure, and input data were generated at random. The tests considered a pair of settings of the training set size: Z ¼ f500; 2000g and two configurations of the hidden layer: N ¼ f200; 1000g. Generalization performances were not an issue here because the proposed devices adopted the floating point representation and the focus was on computational efficiency. The algorithms were all implemented in Python by using the Numpy library.
In the following, all the results will be arranged according to the size of the input layer. Each figure will include a pair of graphs, one for each setting of the training-set size, Z. The x-axis will group the results based on three settings of the hidden layer of the sub-networks: f0:5N; 0:3N; 0:1Ng. Each group involved five configurations of the parameterZ. As a consequence, the left-most bar always will refer to the most demanding configuration, whereas the right-most bars will mark the most profitable settings. The y-axis will show the values of h ¼ where T proposal is the training time of the proposed approach and T L2 is the training time of a L2 basic classifier. Thus, values of h\1 and h [ 1 will indicate that the proposed solution proved faster or slower, respectively. The horizontal red line will mark the case when h ¼ 1. The L 2 regularized network was adopted as a reference comparison because-according to the analytical analysis of the computational cost-it proved more efficient than the solution proposed in [21].
The first experiment involved the Quad-Core Cortex-A7 with 500 MB of RAM memory. The entire training process was supported by core 1; hence, no parallelism was exploited. This experimental setup simulated a heavily constrained scenario; the amount of resources involved was considerably smaller than those available on an average smartphone. Figure 16 gives the results for the configuration with N ¼ 200. The proposed approach proved extremely convenient in term of latency. The cost always kept lower than 10% in all the configurations withÑ ¼ 0:1N. Interestingly, the advantage for the configuration with the highest computational load,Ñ ¼ 0:5N andZ ¼ 0:9Z, approximated the best value 1. On the other hand, setting Q ¼ 10 implied a worst-case analysis. Figure 17 considers the same hardware setup with a different size of the hidden layer, i.e., N ¼ 1000. The experimental result showed a similar trend to the experiments reported previously.
The second set of implementation configurations involved the Quad-Core Cortex-A53 with 1 GB of RAM memory. As in previous tests, only one core of the Cortex-A53 was enabled. The major difference in this setup was the amount of memory (twice as much as compared with the Cortex A-7 tests). Figures 18 and 19 present the related outcomes. The graphs show that the dropout-based strategy still attained remarkable speed-up values for configurations withÑ ¼ 0:1N andÑ ¼ 0:3N when Z ¼ 500. When Z ¼ 2000, the speedup kept comparable with the performances scored in the presence of limited training sets (i.e., less than 1000). Simulations always addressed a worst-case analysis, in which Q ¼ 10.
The final experimental setup involved configurations that fully exploited the available hardware resources. The proposed strategy relied on multi-threading; a thread was instantiated for each sub-network. Such an approach clearly implied a larger memory (RAM) consumption (Q times bigger), since the threads were expected to run in parallel. The number of available cores (4) set the corresponding best possible speedup value. For this reason, only the Quad-Core Cortex-A53 with 1 GB of RAM memory was used for these experiments. Figures 20 and 21 report on the results of the tests for N ¼ 200 and N ¼ 1000, respectively. The reported results point out the advantages in latency featured by the proposed method in the configurations with 0.3N and 0.1N. The configuration 0.5N actually suffered from the limited available RAM; this prevented an efficient execution of multiple tasks in parallel.
The outcomes of the experimental session about HW implementation can be summarized as follows: • In the presence of tight memory constraints, the proposed solution scored remarkable speedup values in almost all configurations. • When more relaxed memory constraints were allowed, as per Figs. 18 and 19, the speed-up performances still proved significant when the networks sizes kept smaller than 0.3N. In the other cases, the values assumed by Q brought about a significant impact. • When applying multiprocessing, as per Figs. 20 and 21, the implementations confirmed that the amount of shared memory influenced the speed-up performances.
Finally, it is worth noting that the configurations (involving N andZ) that resulted most profitable in terms of hardware implementation also proved most effective in terms of generalization performances.

Conclusions
The paper proposed a novel training procedure for RBN in resources-constrained scenarios. The focus of the proposed method was the trade-off between generalization performances and the computational cost of the training phase. The major outcome of the described research consists in showing the feasibility and effectiveness of the proposed method to implement the learning phase on IoT devices. Extensive experiments have confirmed the satisfactory generalization performances of the proposed strategy. In particular, an extensive implementation analysis confirmed the feasibility of the proposed approach in low-power devices.