1 Introduction

Neural Networks often demonstrate excellent proficiency in performing complex and challenging problems such as image classification, speech recognition and natural language processing [4, 35]. To design an optimal neural network architecture is usually considered to be a manual exercise, relying heavily on human experience and expertise. Nevertheless, recent work in Neural Architecture Search (NAS), to automate the exploration of neural networks architectures, has exhibited some highly effective methodologies. These methodologies diligently search for an efficient neural network architecture for the specific task, while demanding low human expertise and interference. The NAS algorithms that have been proposed over the last few years, cover a wide range of domains and search techniques. They vary in their definition of the search space, search strategy, or performance estimation technique [17]. Moreover, the discovered neural network architectures often outperform the manually formulated ones [7, 37, 58].

Popular NAS methodologies use evolutionary algorithms [33, 38] and reinforcement learning [37, 57], however, they need numerous GPUs consecutively, to prepare, run and converge the search process, which can consume tens to thousands of days. Many of these approaches rely heavily on resource-intensive training to direct the search algorithm, which is usually considered to be an isolated and separate task, to estimate the model performance. NAS is examined from a different viewpoint in our work, by looking at the possibility of searching for efficient architectures during a modified and extended training process, contradicting the conventional training methodology as a distinct function, required for accuracy evaluation.

Moreover, recent research in NAS approaches is chiefly focused on datasets in the image classification domain, specifically CIFAR-10 [26] and Imagenet [14], which has contrived novel and intricate search spaces typically matched to the vision based tasks. These innovative search spaces are sketched from previous hand-crafted architectures, for example, residual connections [22], cell based designs such as inception [47], dense net [24] or generated in a graph-like fashion [46]. Even though these innovative search spaces exhibit ingenuity as well as efficiency, they are complex to understand, design and train. Many tasks and domains that are of medium complexity such as human activity recognition [49], earth sciences [28, 36] and astronomical studies [11] utilize plain Convolutional Neural Networks (CNNs) in their research. They are considered sufficient and are well understood by scientists and researchers who do not come from a background in Artificial Intelligence (AI).

Therefore, a tool, which is convenient to be used by non-AI experts from various domains, and one which finds an efficient CNN in a timely manner, is vital to simplify the design process of neural networks and subsequently democratizing AI. For example, a neural network for breast cancer classification has been proposed in [55], which was manually designed and further optimized. For a non-AI expert, in the absence of relevant tools, limited knowledge about the neural network architecture design, can be a hindrance in effectively utilizing AI models in their respective domains. In this direction, the chief contribution of this work is a novel algorithm for NAS, called Evolutionary Piecemeal Training. This paper is an extension of our earlier work presented in [41], and the methodology has been appended with the ability to define multiple objectives for the search process in this paper.

Our algorithm explores the constrained design space of CNNs for the selected task and attempts to discover an efficient architecture while converging in a restrained amount of GPU hours. The CNN models that are created and altered during the search are constrained by a minimum and maximum value for each of the architecture parameters. These bounds have been set up to ensure that the size of all CNN architectures is regulated, and at the same time, it limits their potential to outgrow the availability of hardware resources. This is one of the crucial consideration for models intended to be implemented on embedded systems, for instance wearables in the Human Activity Recognition (HAR) domain [31, 32]. High computational and memory demands by large neural networks may result in inefficient utilization of the limited resources on the embedded device.

The proposed NAS technique in this work is based on a population based computation method which allows a group of neural networks to train simultaneously. During the training process, random CNNs from the population are chosen and evolutionary operators applied to them. These evolutionary operators are designed in such a way that they lead to small architecture modifications and hence guide the exploration of the larger search space. Every new architecture derived through modification is invariably partially trained, since the parent architecture was already undergoing the training. In every subsequent iteration, CNNs continue to train while some of them are subject to architecture modifications. When the algorithm converges, the best individual models (with high accuracy) are chosen from the population. These selected CNNs can be processed and trained for more epochs.

In particular, we use a genetic algorithm in our methodology, which was chosen after considering various factors from amongst different meta-heuristic algorithms, such as, Simulated Annealing and Particle Swarm Optimization. The most important factor was the ability of the algorithm to simultaneously allow the training to continue, while searching for an appropriate neural architecture. To this end, the genetic operators (mutation and crossover), can be defined in such a manner, that there is minimum disruption to the training process, while the large design space of architectures is explored. Additionally, these algorithms are well studied in the multiple-objective search domain. At the onset of this research, the potential to extend to multi-objective search was taken into consideration. Moreover, other NAS methodologies, which are based on evolutionary algorithms, prominently use the genetic algorithm in their research [33, 39].

In the current work, we perform experiments with the PAMAP2 [40] dataset, for the HAR domain, in which data is measured from body worn sensors on a person’s body to anticipate the activity being performed by the human wearer. The versatility of this approach is demonstrated through the use of the CIFAR-10 dataset in the domain of image classification, markedly different from the HAR domain in terms of both input data type and format.

The methodology is further extended to consider multiple objectives for the search. To prove the efficacy of the extension to original methodology, we consider the reduction of the number of parameters of the neural network as an additional search goal. The accuracy maximization and parameter minimization can be conflicting objectives for an efficient neural network. Smaller CNNs tend to have lower accuracy and high accuracy is generally obtained by larger neural networks. However, too many parameters tend to cause over-fitting, which may lead to poor generalization, and therefore, highlight the importance of a constrained search process not only for hardware resource usage, but also to avoid over-fitting. With multiple objective based search, selection of the best candidates is concluded through Pareto optimization, where any objective cannot be improved without worsening some of the other objectives. The set of candidates selected in such a fashion are collectively called a Pareto Front. We apply this proposed extension to the PAMAP2 dataset and perform the multi-objective search for efficient architectures. The Pareto Front obtained upon convergence sets forth the various possible CNNs to deploy on the wearable embedded device. It allows the designer to be aware of the architecture choices available in terms of which CNNs provide a trade-off between the size of CNN versus the accuracy. One of these CNNs can be strategically deployed depending upon the desired functional goals and available resources on the device.

The remainder of this paper is structured as follows. Firstly, in Section 2, recently published related works for NAS are discussed. Subsequently, Section 3 presents our methodology and concepts, and further outlines the algorithm in detail. Next, the experimental setup is explained and the results from evaluations and validations are described in Section 4. Lastly, Section 5 concludes the paper.

2 Related work

Various research works have been published recently demonstrating proficiency of NAS techniques. They can be partitioned mainly into three categories, namely Reinforcement Learning (RL) based, evolutionary algorithms and one-shot architecture search. Both reinforcement learning and evolutionary based algorithms, mandate the complete training of the neural network, at each iteration, for performance evaluation. In RL based methods [5, 37, 57], the validation performance, or accuracy, of the trained model guides the reward towards the RL agent. When the agent is continuously rewarded for finding better architectures, the search is slowly steered towards neural networks with higher performance. Any RL approach resolutely demands a suitable agent, which frequently happens be a complicated model or perhaps another neural network. The construction of the agent and its optimization involve substantial effort towards designing and subsequent fine tuning.

Evolutionary methodologies [9, 30, 38, 39, 52] utilize genetic algorithms to discover the efficient neural architecture in a large search space. Evolutionary algorithms have also been successfully deployed for CNN optimizations, such as ,for compression [25], pruning [19] and hyper-parameter optimization [45]. Evolutionary NAS algorithms work with a population of possible CNN candidates, where each one of them is trained and evaluated at every iteration. With subsequent iterations, the models in the population get selected, rejected and modified depending on their accuracy and other control variables of the algorithm. The aim is to improve the population’s average performance with time, and eventually when the algorithm converges, it has discovered an architecture for a high performing neural network. Our work also utilizes an evolutionary algorithm for architecture modification, where the key difference is in the manner training and architecture modifications are interwoven in the algorithm to conduct joint search for both weights and architecture.

Unfortunately, most of these approaches demand intensive computational resources to train hundreds and thousands of neural network architectures. For example, the RL method in [57] trained more than 10,000 models, involving over a thousand GPU days, while another adept evolutionary search [16] required 56 GPU days to finally converge. Other works have utilized proxy tasks, for example, hyper-networks [6], predictors [5, 13] and controllers [37] to fasten up the search process. However, they still continue to demand abundant planning and time to be implemented before the actual search commences. In direct contrast, our algorithm does not require helpers and proxy tasks and still converges in a reasonable time.

One-Shot NAS methodologies are based on the concept of a trained super-network, which constitute all possible sub-networks within. The entire super-network may have to fit in the GPU memory during the NAS execution, which results in a highly restricted architecture size, and it typically results in a small cell with a limited amount of operations. DropPath [58] is an example of one-shot search approach, where a path is dropped out with a fixed probability, and by randomly removing different paths, a new sub-network is formed. The pre-trained super-network is then used to evaluate and eventually discover the best sub-network architecture. DARTS [29] additionally proposed an architecture parameter for every path and by employing the standard gradient descent to train the weight and architecture parameters together. Other approaches attempt to be more efficient by utilizing other proxy tasks, for instance [7] proposed a memory-efficient algorithm to update fewer paths while searching. The cell that is discovered through one-shot search, needs to be sufficiently repeated and connected in an appropriate manner, to eventually form a neural network that will perform the task efficiently. Aside from posing a meta-architecture design challenge, the models based on replication of single cells, may not be suitable to various domains.

The deployment of neural networks in a wide range of settings requires a multi-objective perspective these days with the advent of connected and intelligent edge devices [42, 56]. These devices do not have large computation or storage capabilities. Keeping this in mind, most of the recent NAS methodologies have started to focus on other objectives other than the accuracy of the neural network, such as, number of parameters, number of mathematical operations, hardware usage (power, memory), latency, etc. The multi-objective NAS based on evolutionary algorithms have been widely explored for multi-objective search as their ability to easily incorporate an extra objective is well known [10]. LEMONADE [16] is an evolutionary based multi-objective algorithm, which utilizes the Lamarckian inheritance mechanism. NSGA-Net [30] utilizes the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to construct the Pareto Front with the aim of learning the trade-off between model classification error with its computational complexity.

Other approaches for multi-objective search have also been researched in recent times, for instance, MONAS [23] and MnasNet [48] are Reinforcement Learning based approaches for multi-objective NAS. While MONAS uses validation accuracy and energy expenditure on the target model to create a pareto front, MnasNet explicitly incorporates the latency information in the main search objective to discover models with a good trade-off across accuracy and latency.

In order to deploy the one-shot architecture search in a multi-objective scenario, the objectives are incorporated into the loss function of the training process. For example, FBNet [50] and DenseNAS [18] are multi-objective one-shot differentiable NAS frameworks. In the former method, the loss function is the weighted product of cross-entropy loss, incurred during training, with the latency of the target device. Whereas in the latter work, loss function is the weighted aggregation of cross-entropy loss along with the latency-based regularization. However, it is not possible to add all objectives to the search in this manner, thereby making them inflexible to embrace an additional objective on the go.

3 Methodology

In this section, we go into the details of our proposed methodology, Evolutionary Piecemeal Training, describe its key concepts and the complete algorithm. Piecemeal training makes a reference to the training of a CNN with a small ‘data-piece’ randomly taken from the whole dataset, the size of which, referred to as δ, can vary from 5% to 20% of the dataset. The conventional training of a neural network is regularly interrupted through an evolutionary operator, at intervals determined by δ. The operator modifies some parameters in the model architecture, and subsequently permits the continuation of the training process. Numerous CNNs begin this training in parallel, creating a population that is subject to architecture modifications after each iteration of piecemeal training. The models that do not perform as well as other models are removed from the population. Conceptually, in the context of neural network training, this can be envisioned as the early termination of candidates showing no promise in their ability to reach a high accuracy.

3.1 Search space

All possible neural network architectures with their configurations and constraints constitute the search space for our work. More specifically, we focus on linearly connected plain CNNs, where layers are only connected to their consecutive layer, and do not have complex connectivity through residual connections or branches. We anticipate that the search methodology can be extended to more complicated search spaces. However, this particular research is focused on plain CNNs, which are used by many non-AI experts in their respective domains and may be considered sufficient for the given task [11, 36, 49].

Formally, a neural network \(\widetilde {nn}\) consists of its architecture T and weights \(\omega \in \mathbb {R} \).

$$ \begin{array}{@{}rcl@{}} \widetilde{nn} = \{T, \omega\} \end{array} $$
(1)

Here, T is a sequence of layers, which can further be grouped into clusters of consecutively similar layers. Fig. 1 illustrates the concept of cluster formation for a simple CNN. All subsequent layers that are of same type are grouped in the same cluster, for instance, the first two convolutional layers are in the cluster C1 and cluster C5 contains only fully-connected layers. A general cluster based architecture T with l clusters, and with I and O as input and output layers respectively, can now be defined as:

$$ \begin{array}{@{}rcl@{}} {T = \{I, C_{1}, C_{2} ... C_{l}, O\}}, \end{array} $$
(2)
Fig. 1
figure 1

Plain CNN architecture where similar consecutive layers are grouped into clusters. Conv is convolutional, MaxP is max-pool and FC is fully-connected layer respectively

Fig. 2
figure 2

A general cluster based architecture with l clusters, where each cluster defines its layer type along with the constraints it enforces on member layers

Figure 2 illustrates a general cluster based architecture. Additionally, every cluster in the architecture places constraints on its number of layers, on the number of units per layer, and on other layer specific parameters such as the kernel size in a convolutional layer. A cluster Ck of type \(C_{k}^{type}\) consisting of n layers can be represented by:

$$ \begin{array}{@{}rcl@{}} \ C_{k}& =& \{L_{k1}, L_{k2} ... L_{kn}\}\colon \beta_{k}^{min} \leq n \leq \beta_{k}^{max}\\ \mathrm{where,} \ L_{ki} &=& \left\{ L | L \in [C_{k}^{type}, \eta_{ki}, p_{ki}]\right\} \colon \eta_{k}^{low} \leq \eta_{ki} \leq \eta_{k}^{up}\\ &\ni& \left\{\beta_{k}^{min}, \beta_{k}^{max}, \eta_{k}^{low}, \eta_{k}^{up} \in \mathbb{N}^{+}\right\} \text{and} p_{ki }\in \pi_{k} \end{array} $$
(3)

Where (\(\beta _{k}^{min}, \beta _{k}^{max}\)) are the bounds on the number of layers possible in the cluster, and the number of neurons in a layer are bounded by (\(\eta _{k}^{low}, \eta _{k}^{up}\)). Moreover, the hyper-parameters that are dependant on the layer type, pki, such as kernel size and stride, are defined by πk in the cluster. These constraints are specific to each cluster and are independent from bounds of the other clusters. Every layer in a cluster is of same type (e.g., convolution, pooling, fully connected), and its hyper-parameters conform to the constraints placed by its parent cluster.

For our experiments, all clusters and their boundary definitions are construed before the start of the search algorithm. All possible permutations of layers and their hyper-parameters together represent the whole search space. This architecture search space is usually not a trivial space to navigate. For example, the search spaces for experiments in Section 4.2 have 108 (CIFAR-10) and 105 (PAMAP2) possible architectures. This search space definition is encoded in the form of a genotype, which ensures that not only the factory to generate new neural networks, but also the evolutionary operators, adhere to the cluster constraints.

3.2 Population based training

As mentioned before, the training process employed in this work is based on a population based approach. A population of CNN models is randomly picked from the defined search space and subsequently randomly initialized. With each iteration, every individual model in the population is piecemeal-trained with delta data points. Afterwards, their performance is evaluated on the validation set, which is then engaged by the algorithm to define the next generation of the population through selection, replacement and modification functions.

While a neural network is being trained, its weights are constantly changing, and in that sense both the weights and the neural network may be considered as functions of time (i.e., iterations): ω(t) and \(\widetilde {nn} (t)\). \(\widetilde {nn} (0)\) is then the initial neural network at the beginning of its training with randomly initialized weights ω(0). The architecture of this neural network remains unchanged during the training. Hence,

$$ \begin{array}{@{}rcl@{}} \widetilde{nn} (t) = \{T, \omega (t)\} \end{array} $$
(4)

A neural network \(\widetilde {nn} \in \widetilde {NN}\), where \(\widetilde {NN}\) is the set of all possible neural networks with an architecture \(T \in \widetilde {T}\), where \(\widetilde {T}\) is the set of all architectures defined in the search space along with its constraints. The population of neural networks may also be seen as a function of time. The population \(\widetilde {P} (t)\) of CNNs at any given time can then be defined as a set of neural networks at that point,

$$ \begin{array}{@{}rcl@{}} \widetilde{P} (t) = \left\{ \widetilde{nn_{1}}(t), \widetilde{nn_{2}}(t), \widetilde{nn_{3}}(t) ..... \widetilde{nn_{s}}(t)\right\} \colon s \in \mathbb{N}^{+}\ \end{array} $$
(5)

The population size, s, is constant throughout the duration of the algorithm. If a neural network is dropped from the population because it is not performing as well as the other models, then it has to be replaced by another neural network. Moreover, the population size is required to be large enough so that enough diversity is maintained among the CNN models in the population.

The algorithm runs for τmax iterations, the value of which is dependent on the nature and complexity of the task. τmax can be defined at the beginning of the search or can be updated during the iterations based on the rate of change of the evaluation metric, such as prediction accuracy. When the change in the evaluation metric slows down, the algorithm is highly likely to be near convergence and therefore the algorithm can halt further iterations.

At any time t, the population is then defined as,

$$ \begin{array}{@{}rcl@{}} \forall 1 \leq t \leq \tau_{max}, \widetilde{P} (t) = f_{ept}(\widetilde{P} (t-1)) \end{array} $$
(6)

The function fept() is applied to the population during every iteration. In absence of the evolutionary architecture modifications, fept() consists of only the training function ftrain(), which trains a CNN with a small subset of training data. It is possible to train all neural networks in an iteration, in any order of sequential and parallel executions, to best utilize the computational resources available.

3.3 Evolutionary operators

The evolutionary algorithm needs to sufficiently explore the huge search space to ensure that a good model with high performance can be discovered in a reasonable time. To this cause, at each evolutionary step, architectures of some of the models in the population are modified through one of the evolutionary operators, i.e. the recombination or mutation operator. While mutation does small changes to one layer at a time, recombination exchanges some layers in one model to another to create significantly different models. In this respect, the mutation operator explores the search space closer to the existing population, and in contrast, the recombination operator explores a wider design space by generating distant architectures. The number of evolutionary operators executed in each iteration is controlled through a pre-defined mutation rate (Pm) and recombination rate (Pr).

Mutation operates on a CNN and randomly selects one layer to change one of its hyper-parameters, such as the number of kernel units in the layer or the kernel size. We employ the Net2Wider operator from [8] to broaden the layer by increasing the number of kernel units. On the other hand, to shrink the layer, we use a pruning process [27], which reduces the number of units by removing the least significant kernels in terms of their activation weights. Kernels are radially zero-padded or cropped from the outer edge, when their size changes because of mutation.

Furthermore, the mutation operator is devised to be function-preserving in nature to make sure that mutation does not disrupt the ongoing training of the neural networks. Any change to the architecture will invariably cause an additional loss in the training process. The functions in mutation operator were particularly chosen since these are either totally, or at the minimum, partially function-preserving, implying that the loss drawn from these operators is as minimum as possible and recovers quickly during later piecemeal-training iterations.

In direct contrast to mutation, the recombination operates on two neural networks and swaps all their layers in a cluster. The swap is carried out for only one randomly selected cluster position. Figure 3 shows an example of the recombination operator, which swaps different numbers of layers from the clusters C2 in two different neural networks. Since all models have exactly the same number of clusters, it follows that the layers that are exchanged are approximately in the same stage of neural computation, and hence the new models need minimum repair to the architecture to remain valid.

Fig. 3
figure 3

Recombination operator applied to two neural networks, where the cluster C2 gets swapped. The cluster has a different number of layers in each model

It is important to note that the recombination is not a function preserving operator. However, they are required in the algorithm to introduce and maintain diversity [43] by introducing significantly varied models into the population. This is achievable due to the fact that the total number of layers being swapped is not the same. To diminish the adverse effect of loss incurred by the recombination operator, a cooling-down approach is applied to the recombination rate. During the early iterations, when the training loss has not yet started to converge and is at a high value, more swaps are allowed as compared to later iterations, when the training loss is low.

Together, these evolutionary operators are responsible for traversing the large design space of neural network architectures in an efficient manner. Additionally, these operators are responsible for making sure that the cluster constraints are always adhered to. The mutation operator never allows a layer to expand or narrow beyond the cluster defined boundaries, and it also ensures that other layer hyper-parameters conform to the cluster specifications as well. The recombination operator swaps clusters which are already within their bounds, thus maintaining the constraints.

3.4 Selection and replacement

One of the most important features of the population based evolutionary approach is that every subsequent population attempts to be better than the one in the previous iteration. This is achieved through selection and replacement policies geared towards retaining the better performing candidates at every step. A remove-worst strategy is employed to select the next generation of the population. However, the rejection rate is kept relatively low, around 2-5% of the total population. To keep the population size constant, individuals are selected and put back in the population using a non-elitist random selection policy. This means that every neural network in the population has an equal chance of being selected to replace the worst performing model.

To summarize, if fevo() represents the evolutionary operator function and fselect() represents the population selection function then, fept() from (6), for population \(\widetilde {P} (t)\) at any iteration t, can be defined as

$$ \begin{array}{@{}rcl@{}} f_{ept}\left( \widetilde{P} (t)\right) = f_{train} \circ f_{select} \circ f_{evo}\left( \widetilde{P} (t)\right) \end{array} $$
(7)

The function fept(), as composition of other mentioned functions, defines the transition function of the population from one iteration to the next. ftrain() trains every CNN in the population, fselect() evaluates the population and selects (or rejects) suitable candidates, some of which then go through evolutionary operators in fevo().

As the iterations continue, the population keeps gradually changing, due to architecture alterations, along with appropriate selection and replacement of models.

3.5 Optimization objective

The main objective of our algorithm is to find a neural network with maximum accuracy possible. Let \(Acc(\widetilde {nn}(T,\omega (t))\) represent the accuracy of a neural network \(\widetilde {nn}\). Given \(\widetilde {T}\) as the set of all possible architectures and \(\widetilde {NN}\) as the set of all possible neural networks, the objective is to find neural network \(\widetilde {nn}'\) (\(\widetilde {nn}' \in \widetilde {NN}\)) with architecture (\(T' \in \widetilde {T}\)), such that

$$ \begin{array}{@{}rcl@{}} max_{\widetilde{nn}' \in \widetilde{NN}} Acc( \widetilde{nn}'(T',\omega(t)) \end{array} $$
(8)

The optimization objective helps to formulate the fselect() function to which the whole population is subject to after every iteration. To maximise accuracy, fselect() chooses the best performing CNNs, able to reach higher accuracy in the population.

It is possible to extend our algorithm for multi-objective search, and we demonstrate the concept by adding minimization of the number of parameters as another objective. Let \(Params(\widetilde {nn}(T,\omega (t))\) represent the number of parameters of a neural network \(\widetilde {nn}\). Then, the objective is to find neural network \(\widetilde {nn}^{\prime \prime }\) with architecture (\(T^{\prime \prime } \in \widetilde {T}\)), such that

$$ \begin{array}{@{}rcl@{}} min_{\widetilde{nn}^{\prime\prime} \in \widetilde{NN}} Params( \widetilde{nn}^{\prime\prime}(T^{\prime\prime},\omega(t)) \end{array} $$
(9)

When the optimization objectives are conflicting with each other, as is the case with accuracy maximization and parameter minimization, fselect() can not be as simple as selecting the best candidates based on linear sorting on one specification. The selection function now has to consider a sorting based on non-domination of any single objective and select the best candidates. We use the NSGA-II selection algorithm [12] in our work to make sure that both optimization objectives are catered for during the search.

3.6 Workflow and algorithm

We assemble all the concepts described and consolidate them to present an overview of the workflow (Fig. 4), together with the complete algorithm (Algorithm 1). The algorithm begins with the initial set up of the configuration parameters for both training and evolutionary operators. The evolutionary inputs are, Ng: the number of iterations, Np: the population size, Pr: Recombination rate, Pm: Mutation rate and α: the selection policy for each iteration. Additionally, the training parameters τparams including optimizer choice, learning rate, batch size etc., and δ to determine the size of the subset of data to be used for piecemeal training are provided. All the evolutionary and training parameters are empirically selected after conducting a small number of initial experiments, in order to fine-tune the algorithm. The selection of evolutionary parameters is mainly guided by available computing resources, along with the complexity of the task. Whereas the training parameters are determined through a small grid search by training a few architectures from the population prior to commencement of the whole algorithm. These parameters stay constant throughout the algorithm, unless specifically stated.

Fig. 4
figure 4

Workflow for the evolutionary piecemeal training

The population is generated using the genotype, which represents the search space, and contains all the information about clusters and their constraints. InitializePopulation() creates the population o, of Np neural networks, and their initialization can be random or through training for a few epochs.

figure c

Once the start set-up is complete, the iterative core of the algorithm is initiated and this iterative algorithm runs forNg generations. The population at every ith iteration is referred to as i. Firstly, the function PiecemealTrain() trains all individuals in the population i, with the random subset of the data. Next, EvaluateAccuracy() evaluates the accuracy of every model in the population, based on which, BestSelection() selectsα best individuals found so far(best), from the whole population . To keep rejection rate low, α is chosen to be a high ratio of > 0.95 ∗ Np, which is important to keep the focus on removing the poor performing architectures gradually from the population. This approach discourages the promotion of a model that is able to learn fast but is unable to finally reach a high accuracy. Afterwards, to keep the population size constant, 1 − α neural networks(r), are randomly selected from survivors. The population i is updated by replacing the population with best andr. Random selection makes sure that subsequent generations do not get crowded with only one parent architecture, which got higher accuracy by chance due to the stochastic nature of training. The evolutionary operators, Recombine() andMutate(), select individuals (rc andmu), with probability of Pm andPr respectively, from the population to alter some of the neural network architectures. Pr is linearly cooled to ≈ 0 from its initial value. The population is then updated with modified neural networks from rc andmu, while the part of the population not undergoing any modification (remaining) remains unchanged for the next iteration of the algorithm.

After the iterations conclude, the algorithm evaluates all the remaining models in the final population and returns the best neural networks determined. These best models are processed and modified, if needed, and further trained to achieve final CNN configurations. Other hyper-parameter optimization techniques [53] can be utilized to find the optimal training parameters, in order to train the CNNs at this stage.

Furthermore, this methodology is extended to be suitable for multiple objectives besides the accuracy. Figure 5 highlights the changes done to the primary workflow, for the multi-objective search. Initialization and piecemeal-training steps remain the same, however, additional evaluations are performed to deduce the number of parameters of the model. Parameter minimization now becomes the second search objective, to be considered along with accuracy maximization for the search. The selection policy is updated to be based on non-dominated sorting [12], which takes into account all the objectives to sort the models in the order of their relative quality of performance. Replacement policy, evolutionary operator application and population update steps are unchanged in the extended algorithm.

After the iterations are complete, a Pareto optimal set is selected from the population based on all evaluated objectives. The CNNs in this Pareto optimal set, also called as Pareto Front, are selected to be eventually completely trained. All the models in the Pareto set are considered to be equally adequate to be marked as the best model. In this scenario, the final selection lies in the hands of system designer, and may also be based on higher priority placed on one of the objectives.

We outline the modified and extended algorithm for multi-objective search in Algorithm 2. The changes to the original Algorithm 1, reflect the same changes that were highlighted in the multi-objective workflow in Fig. 5. EvaluateParameters() evaluates the number of parameters of CNNs in the population. NSGA2Selection() replaces the previous selection algorithm with α still kept at very high value.This function is based on the popular multi-objective selection algorithm, NSGA-II [12], which takes all objectives into account when selecting the best individuals. This algorithm returns the Pareto Front from the population.

Fig. 5
figure 5

Workflow for the Evolutionary Piecemeal Training, extended version to include multiple objectives

figure d

4 Experiments

In this section, the experimental setup is presented in addition to the algorithm’s evaluations using two datasets. We describe features and format of both the datasets, their respective search spaces with the constraints and the results achieved. We have utilized the Java based Jenetics library [1] for evolution based operators and computation, while the Python based Caffe2 [3] library was used for the training and accuracy evaluation. The neural networks were represented in the ONNX [2] format, which combines architecture and weights in one file, and facilitates the storage and transfer of the CNNs across different modules. All our experiments were performed on a single GeForce RTX 2080Ti GPU.

4.1 Datasets

In the experiments, the CIFAR-10 dataset for image classification and the PAMAP2 dataset of human activity recognition have been used. CIFAR-10 is a labelled set of 60,000 images, bifurcated into train and test sets in the ratio of 5 : 1. The images are of size 32 × 32 × 3 and are divided into 10 classes. We reserved 5,000 images from the training set, to use them for validation during the search process. The test set was eventually used to evaluate the final accuracy of the neural network, which is what we report in this paper. Standard data augmentation techniques were deployed [20, 44], which include small amounts of translation, crop, rotation and horizontal flip. During the piecemeal-training process, an image set of size δ was randomly chosen from the training set, out of which approximately 50% were subjected to the augmentation pipeline.

The markedly different PAMAP2 dataset compiles recordings from body-worn sensors. The input is organised as time-series data from a total of 40 channels from three Inertial Measurement Units (IMU) along with a heart rate monitor. The person wearing these sensors performed one of the twelve different activities in everyday life. We ignored some of the optional activities provided in the dataset. For a fair comparison with other papers, the validation and test sets were same as in [21, 51], i.e. the recordings from participants 5 and 6 were used as validation and test sets respectively. To prepare the data, firstly, the recordings from all IMUs were downsampled to 30 Hz and secondly, the data was segmented through a sliding window approach, with a window size of 3s (100 samples) and step size of 660ms (22 samples). For appropriate data augmentation, we moved the sliding window using different step sizes while the window size was kept the same at 3s.

The experiment for the extended version of the algorithm with multiple objectives for the search has been performed on the PAMAP2 dataset only. The definition of the architecture search space, and initialization of hyper-parameters for both evolutionary operators and training is exactly the same as the original algorithm.

4.2 Search space

We outline the search space specifications for CIFAR-10 and PAMAP2 in Tables 1 and 2 respectively. In the CIFAR-10 experiments, the number of kernel units per layer were multiples of 16, which brings the total number of models in the search space to the order of 108. For PAMAP2, the number of kernels per layer are multiples of 8 and the total design points are to the order of 105.

Table 1 CIFAR-10 architecture search space
Table 2 PAMAP2 architecture search space

4.3 Training setup

The CIFAR-10 dataset was trained for 80 generations, while the population size was kept at 80. The data size, δ, was set to 4,000 images to be used by the piecemeal-training. Every convolution and fully connected layer was appended by ReLu activations. Training was performed with the Adam optimizer, while the batch size was set at 80. Initial learning rate was set to 5e− 4, with a step learning rate decay policy where the learning rate was reduced by 1e− 4 at the interval of 20 iterations. The evolutionary selection probabilities Pm and Pr were both set to 0.3 at the beginning. Pm stayed constant, whereas Pr was linearly reduced to reach 0.01 at the last iteration.

For the second dataset, PAMAP2, training was done during 30 generations with a population size of 50. The data size, δ, for piecemeal-training was set to 20,000 samples. ReLu activations follow every convolution and fully connected layer. The training was performed with the Adam optimizer, the batch size of 100 , and a constant learning rate of 1e− 4. Evolutionary selection probabilities Pm and Pr were both initialized as 0.3. Similar to the CIFAR-10 experiment, Pm is kept constant, whereas Pr is linearly reduced.

The CIFAR-10 models are substantially more memory consuming than the PAMAP2 models, which limits the amount of parallelism for training, on a single GPU with 11 GB memory. For CIFAR-10, 4 parallel training threads could execute, while 7 simultaneous threads for PAMAP2 could run. The limit on the level of parallel executions was governed by the GPU memory available. Additionally, no Batch Normalization was used in order to fasten up the search, since it consumes more memory and therefore reduces the parallelism. Once the search finished, the best model found was altered, to have a batch normalization layer following every convolutional layer and was trained for 100 epochs more.

4.4 Results

This section presents the results obtained through the algorithm execution, which we then compare with the state-of-the-art. We show the results of the extended multi-objective search algorithm for PAMAP2 dataset afterwards.

The training curves for experiments on PAMAP2 and CIFAR-10 are depicted in Fig. 6, where accuracy maximization is the only objective in consideration. As the iterations continue, it can be observed that the average accuracy of the whole population generally increases, despite architecture modifications interrupting the training. The best accuracy of any individual in the population is similarly increasing gradually with each iteration. The best model in one iteration may be different from the best one in the next iteration. The best model(s) discovered at completion was trained further for more epochs to achieve the accuracy that is reported in Tables 3 and 4.

Fig. 6
figure 6

Training curves. Average accuracy refers to the average performance of the whole population at the given search iteration. Best accuracy refers to best performance of any individual model in the population

Table 3 CIFAR-10 accuracy comparisons with evolutionary approaches
Table 4 PAMAP2 accuracy comparisons

The first experiment using the CIFAR-10 dataset consumed 2-GPU days and reached the best prediction accuracy of 92.5% on the test set. Table 3 compares our results with other evolutionary based NAS approaches. We understand that when we compare the accuracy of 92.5% to other published works, it ranks slightly lower than the other efforts, however, the key difference is that the architecture space is defined for plain CNNs. There is a marked omission of architectural enhancements such as residual connections and cells in our architecture search space. In addition, advanced data augmentation like mixup [54] or cutout [15] were not deployed either. Other approaches commonly use a hybrid search space, which may include different cell modules or architecture block along with arbitrary residual connections.

Despite the lower accuracy, we emphasize the shorter convergence time, of only 2 GPU-days, when compared with other evolutionary NAS methodologies. The best CNN found by the search algorithm had 13 convolutional layers with addition of 2 fully connected layers as summarized in Fig. 7a.

Fig. 7
figure 7

Best neural networks found for (a) CIFAR-10 dataset and (b) PAMAP2 dataet. Every Convolutional layer is followed by Batch Normalization and ReLu activation layer

The PAMAP2 dataset is an unbalanced dataset, which means that some of the classes are over-represented in the data set. For this reason, we report not only the classification accuracy, but also the weighted F1-score (F1w) and mean F1-score (F1m). These scores are computed using precision and recall for each class, and weigh the classification of each class based on the ratio of class represntation in the dataset. These are computed as:

$$ \begin{array}{@{}rcl@{}} F1_{w} & = &\sum\limits_{i} 2 \times \frac{n_{i}}{N} \times \frac{precision_{i} \times recall_{i}}{precision_{i} + recall_{i}}\\ F1_{m} &= &\frac{2}{N} \times \sum\limits_{i} \frac{precision_{i} \times recall_{i}}{precision_{i} + recall_{i}} \end{array} $$

Here, ni is the number of samples per class and N represents the number of data points in the whole dataset. In comparison to the classification accuracy, especially for unbalanced datasets, the F1-scores provide a better overview about the performance of a neural network. We compute both presented F1-scores in order to compare the result with other published works.

Our algorithm was able to achieve impressive results on the PAMAP2 dataset. The search took 10 GPU-hours, while the best neural network discovered after complete training was able to reach a prediction accuracy of 94.36%. Table 3 compares our algorithm’s results against other published works. In direct comparison, the grid search [21] on neural networks for PAMAP2 was able to reach their best at 93.7% and another hand-crafted model [34] achieved 93.21%. These results clearly demonstrate that our methodology is more effective than the naive algorithms that involve simple approaches such as random search or grid search. The best neural network found had 7 convolutional layers and 3 fully connected layers as shown in Fig. 7b.

Additional experiments were performed with the extended multi-objective search algorithm for the PAMAP2 dataset. The conflicting search objectives were to maximize accuracy while simultaneously minimize the number of parameters of the CNN model. Once the algorithm has finished all the iterations, we plot the graph for accuracy versus the number of parameters for all models in the population, as shown in Fig. 8a. The Pareto Front as selected by the algorithm is marked in red. The candidates that lie on the Pareto Front exhibit the trade-off between two objectives and one cannot be considered better over the other w.r.t both the objectives.

Fig. 8
figure 8

Pareto Fronts for accuracy vs parameters of CNNs. Figure 8a shows the Pareto Front created during the search with EPT. The scattered points represent CNNs from the final population upon convergence. The candidates from Pareto Front are further trained and Fig. 8b shows the “trained” Pareto Front. The Pareto Front is finally updated (Fig. 8c) to remove points that do not fall on it anymore

Subsequently, all the candidates on the Pareto Front were processed and trained further, after the addition of the Batch Normalization layers. Figure 8b now presents two Pareto Fronts, first one being the Pareto Front determined by the algorithm (in red), and the second curve (in green) depicts the associated “trained” Pareto Front. The latter curve plots the accuracy for the same neural networks on the former curve, but as completely trained models.

However, once the training is complete, it is highly probable that some of the models do not follow the rules of a Pareto optimal set anymore. Figure 8c shows a closer look at the “trained” Pareto Front, where the point marked in red clearly does not belong to the Pareto Front any longer. Those points are then removed from consideration and eventually a final Pareto Front is deduced.

The Pareto Front that is finally achieved gives a good indication of the trade-off between size of a neural network versus the performance. In the Pareto Front for PAMAP2, there were 5 points, ranging from 89.99% accuracy to 93.34% accuracy with models consisting of ≈ 200k to ≈ 600k parameters. This gives a designer an important tool to pick the most suitable neural network to deploy on the wearable device. When the memory is limited on the hardware platform, a designer may consider it to be acceptable to use a model with slightly lower accuracy, specially where lower memory footprint may also result in better response time of the device.

5 Conclusion

A novel methodology for NAS, Evolutionary Piecemeal Training, was presented in this paper. For a given task, our algorithm traverses the search space of plain CNNs with the aim of discovering an efficient neural network architecture with reasonable constraints.The methodology was validated on two markedly different datasets, to demonstrate that our approach is versatile and can be adapted to suit distinct domains. Moreover, we illustrated that for moderate complexity tasks, for example the PAMAP2 dataset, our algorithm outperforms random or grid search methodologies. The flexibility of our algorithm was further demonstrated, by extending it to perform multi-objective optimization, which allows the neural network to be optimized simultaneously for multiple task specific objectives, such as parameter minimization and accuracy maximization.

In the future, we aim to modify this search algorithm to incorporate hardware metrics and optimize the resource usage of the neural network on the designated hardware. We envision that the Pareto Front obtained for multiple hardware-specific objectives will allow the designer to have better design choices and more flexibility, in switching from one CNN to another systematically, for both the given task and the hardware. Additionally, we plan to modify the search space to include complex and cell-based neural architectures for further research.