1 Introduction

It is challenging for the hearing impaired to identify important sounds such as running water, dogs barking, and crying babies. Typically, sound event classification systems feed spectrograms into image classification networks with great results [9]. Much of the sound event classification work focuses on a paradigm where audio is sent over an Internet connection to a large neural network (such as ResNet50 [8] with 20+ million parameters) that classifies the sound in the cloud. This approach relies on good internet data speeds. We instead search for a neural network that runs locally for a full day, continuously, on a battery powered device (e.g., smart watch, earphones, phone). This requires an energy-efficient network to avoid draining the battery of the device prematurely and a small enough network to be able to fit into device memory.

Much of the platform aware neural architecture search (NAS) literature has focused on inference time (latency) as a user experience requirement for image classification [1, 21]. Instead we think energy usage is the more important limiting factor.

An always-on audio model calculating an inference once every second makes 86,400 inferences per day. As a result, the energy required per model inference is a critical matter when searching for the best architecture. A smartphone might have a battery capacity of around 51 kJ (e.g., Google Pixel 4 XL), a smartwatch around 3.6 kJ (e.g., Fitbit Versa 3) and earphones around 0.7 kJ, (e.g., Pixel Buds 2). For comparison, the baseline solution of deploying a high-performance network like MobileNetV2 [18] on a Pixel 4 XL big core CPU uses 14 mJ per inference (1.21 kJ per day) when running sound event classification on spectrograms. It’s evident that a network of that size will quickly consume a small device’s battery capacity.

We introduce the first neural architecture search that incorporates the energy usage of the implementation. Our search also minimizes the memory usage of the neural network which, similar to energy usage, can be an equally limiting factor to model deployment on mobile and edge computing devices where total SRAM is limited. Our NAS builds upon related hardware-constrained searches (e.g., searches constrained by hardware limitations such as memory). To find networks that also optimize for low energy and memory usage, we need to incorporate these constraints into our reward function which we discuss in Sect. 3.

We would like to guide our NAS with real hardware energy measurements. But at the scale we are operating (thousands of evaluations per task) this is prohibitive. In this work, we train a random forest model on 10,000 candidate architectures from our search to accurately predict the energy usage of a candidate architecture. We choose a random forest model since random forests are known to work well on a variety of problem domains [2]. We also found that the random forest model outperforms a linear model. After running our search, we run the top three performing neural architectures on Pixel 4 XL CPUs five times to get average energy usage statistics.

We benchmark our work by comparing it with the efficiency of a state-of-the-art network, in this case MobileNetV2, which performed well on the related task of 2D image classification (note, in this task we are classifying 2D spectrograms) [18]. Our NAS focuses on an audio classification task where the task and dataset are defined in Sect. 4. We constrain the maximum number (12) of sequential operations in a candidate neural network. For each sequential position in our candidate network, our search suggests an operation (e.g., either a 3\(\times\)3 convolution or a 5\(\times\)5 convolution). The possible operations in a NAS (described here as the search space) are often defined based on what operations are found in state-of-the-art models on the task/dataset (in this case MobileNetV2). We discuss the search space more in Sect. 5.

Our search algorithm, evolutionary and Bayesian search algorithms defined in Sect. 6, suggests collections of block operations which define a candidate neural network. The search algorithm seeks to suggest network architectures that optimize a reward function which scores each neural network candidate. We use early stopping, where we stop the training of unpromising architecture candidates to reduce the computational burden of the search.

In sum, we present a simple to implement neural architecture search that targets on-device energy efficiency and low memory usage for always-on audio models to satisfy the constraints outlined above. Our main contributions are:

  • We introduce a multi-objective neural architecture search that optimizes not only accuracy but also memory and energy usage. We employ both Bayesian and evolutionary search algorithms returning slightly better results.

  • We train a random forest model to predict the energy usage of candidate neural network architectures in our search space. The model achieves a RMSE of 0.07 mJ per inference, which is a small fraction of typical energy usage per inference of our search space. This allows us to perform an architecture search that includes energy usage estimates without the added complication of including hardware in the search loop. Note, after the search we verify the performance of the winning implementations to make sure we have a real winner.

  • We evaluate our method on a MobileNet-based search space and find a model with accuracy slightly better than MobileNetV2 with 10\(\times\) less energy usage, and 50\(\times\) smaller memory footprint (Table 2).

  • We show FLOPs are not a good proxy for energy usage even on a mobile CPU. Inference time (latency) is a better but imperfect proxy for energy usage. This is because power usage is not consistent across neural networks—memory access and arithmetic operations differ in power (Fig. 2).

  • Our search identifies a computational bottleneck created by combining spectrograms with 2D convolutional blocks (Table 1) which is the typical architecture for audio classification. We show that an alternative approach of swapping the frequency axis with the depth axis of the spectrogram and using 1D convolutional blocks reduces energy usage, but significantly underperforms on the accuracy metric.

2 Related work

Several papers have explored neural architecture searches for neural networks intended for mobile devices. In particular, the MNAS paper of Tan [21] performed their NAS where the inference time (latency) of architecture was included in the search reward function. Their search included a mobile phone in the search loop, where the candidate architectures ran on a mobile phone to return latency measurement.

In the TuNAS paper of Bender [1], the authors avoided using hardware in the search loop to reduce software/hardware engineering requirements since it is significant work to connect mobile phones and measurement devices to the cloud where the NAS takes place. They instead opted to use a linear model to predict the inference time of neural network architectures in their search space. They train a linear model to predict the inference time of each architecture suggested by their search to rank each candidate architecture.

We, on the other hand, target energy per inference instead of inference time and also include a third term, memory usage in our reward function. Similar to TuNAS, we opt to avoid using hardware in the loop, and instead train a model (we use a random forest model instead of a linear model since it performs better) to predict the energy usage of each network architecture in our search space.

TuNAS’ search algorithm creates a super (meta) network that includes all possible architectures into one network. It then drops out entire paths during training. This search algorithm is very efficient since only one (large) architecture needs to be trained instead of many possible candidate architectures trained separately. After training, they then mask their network so that only a single path in the super network is active so they can score a single network’s performance using the super network's trained weights. However, for the trained weights of the supernetwork to be similar to the weights of the lone architecture, significant paths in this network during training needed to be dropped out (turned off) which can make training unstable. The paper’s authors also made this observation. This instability in training is the reason we instead opted to use Vizier’s algorithms which likely sacrifice computation time during the search.

MNAS and TuNAS use reinforcement learning (RL) to suggest new architecture candidates, whereas we opt to use both a Bayesian and genetic algorithms. We made this decision since NAS literature has shown evolutionary algorithms should yield similar results to RL for image classification tasks which we expect to behave similarly to our task of audio classification tasks on 2D spectrogram images [17]. The Bayesian and genetic algorithms are also easier to set up out of the box.

Our search space is the most similar to Wu’s FBNet paper [23]. The authors in this paper used a search space of different block operations. The search there was over the convolution kernel size, number of filters and expansion parameter of each block in the architecture. We similarly search over the kernel size and number of filters. However, FBNet only searches over MobileNetV2-like operation blocks where we instead include MobileNetV1 blocks and other smaller block types, which we hypothesize might be more energy-efficient. Similar to FBNet, TuNAS also builds a MobileNetV2-based search space. We use a smaller maximum network size than TuNAS and FBNet since we are targeting more energy-efficient and memory-efficient networks. We summarize our NAS search features and compare them to related hardware-constrained searches in Table 1. One difference we don’t include in the table to make it easier to visualize is that we focus on audio classification rather than image classification in our search.

Table 1 Comparison of related hardware-constrained neural architecture searches

3 Optimization criteria

We need to find a neural network that finds a balance between energy efficiency and memory usage, while still achieving state-of-the-art accuracy. One option for our search would be to optimize accuracy while treating memory usage and energy usage as hard constraints. This yields Eq. 1 where x is the evaluation dataset, ACC is the accuracy of a candidate network h in our NAS search space H, \(\text {MEM}\) is the memory footprint, and \(\text {ENERGY}\) is the energy usage per inference of the network.

$$\begin{aligned} \begin{aligned} \min _{h \in H} \quad&\mathrm{{ACC}}(h(x))\\ \text {s.t.} \quad&\mathrm{{MEM}}(h) \le M_{0}\\&\mathrm{{ENERGY}}(h) \le E_{0} \\ \end{aligned} \end{aligned}$$
(1)

As noted by the MNASNet authors, this approach maximizes a single metric and does not yield multiple Pareto optimal curves [21]. We are looking for Pareto optimal models (e.g., models which have the maximum accuracy without increasing memory and energy usage). To approximate the Pareto optimal solutions, we combine these optimization constraints into a single objective via a weighted sum (note, MNASNet used a weighted product). In addition, we do not need the absolute lowest energy or memory, and thus we limit the loss below an arbitrary threshold. The reward in Eq. 2 proportionally penalizes larger memory sizes and energy usages.

$$\begin{aligned}{} & {} \, R = \mathrm{{ACC}}(h(x)) \, \, - b\,\max (0, \text {ENERGY}(h)-E_0) \nonumber \\{} & {} \quad - c\,\max (0, \text {MEM}(h)-M_0) \end{aligned}$$
(2)

Memory and energy usage is penalized with a ReLU function that activates after the thresholds, \(M_0\) and \(E_0\), respectively, are crossed. In this study, we use an energy threshold, \(E_0\), of 1.25 mJ per inference, which amounts to slightly more than 0.2% of the Pixel 4 XL battery when the network is running one inference per second all day. Above the energy threshold, \(E_0\), we explored two different slopes: b and b’. The harsher penalty b is set to \(\frac{0.02}{0.75~\mathrm{{mJ}}}\). Thus, above this energy threshold a 0.75 mJ increase in energy per inference must give at least a 2% increase in accuracy for the same reward. The less harsh penalty sets \(b'=\frac{0.02}{1.75~\mathrm{{mJ}}}\).

We use a memory size threshold, \(M_0\), of 60 kB, above which larger memory sizes are penalized with slope \(c=\frac{0.02}{30~\text {kB}}\). The 60 kB threshold is chosen to allow the network to be deployed on a wide variety of SRAM limited devices (e.g., smartphones, smartwatches and earphones) which are expected to have several machine learning applications running simultaneously. The chosen slope means that a 90 kB model must have at least 0.02 accuracy points more than a 60 kB model for it to have a better reward. In the next five subsections, we discuss the quantized accuracy, measuring physical energy usage, approximate energy usage metrics, how we approximate energy usage during NAS using a random forest, and finally the memory usage in the reward function in Eq. 2.

The following subsections give details of the quantized network (to avoid expensive floating point operations), energy approximations and memory usage.

3.1 Quantized accuracy

Fig. 1
figure 1

Quantized network performance. Left: Categorical accuracy versus int-8 quantized TFLite accuracy for two thousand candidate architectures sampled by the Vizier Bayesian (hybrid) algorithm. Right: The parameter count plotted against the quantized TFLite memory size

To measure the performance of the candidate architectures, we use the accuracy of the 8-bit integer quantized TFLite model. The network is quantized using integer-8 quantization-aware training with the Tensorflow framework [4] to minimize the memory and energy usage. There is generally good agreement between the non-quantized accuracy and the quantized accuracy (correlation of 0.955), but there are some outliers (up to 6.5% disagreement in accuracy) as seen in Fig. 1. Since we are targeting on-device inference, we use the quantized accuracy in our reward function.

3.2 Physical energy measurements

We use a Monsoon power monitor [14] to measure the average power draw of a phone (without battery) running a candidate architecture. The energy per network inference is platform dependent, and thus for this paper we focus on the big core CPU of the Pixel 4 XL. During the measurement, we lock the CPU core frequency and use a single thread. The average inference time is measured using the TFLite benchmarking tool. We use these energy measurements in three ways: to check the approximations others have used (Sect. 3.3), to train an approximate model to help guide the NAS (Sect. 3.4), and finally to verify the energy measurements shown in this paper (by repeating the measurement 5 times and reporting the mean and standard deviation).

3.3 Inference time and FLOPs as energy proxies

Other papers have used FLOPs (total number of floating point operations of the unquantized model) or inference time (latency) to approximate energy usage [12, 13]. We discuss the drawbacks to these approaches. Figure 2 shows that a network with a FLOP count of 10 million might use between 0.6 mJ per inference and 1.5 mJ per inference. This agrees with what several authors have reported that the FLOPs count is a poor proxy for energy usage on-device, likely due to memory access not being accounted for in the FLOPs count [25].

Fig. 2
figure 2

Two approximations to the total energy consumption using FLOPs (left) and inference time (right). These scatter plots are based on 15,000 randomly selected architectures in our search space measured on a Pixel 4 XL

We find the correlation between the average inference time and the average energy per inference (which is simply the average inference time multiplied by the average power) is 0.989 over our search space (Fig. 2). Despite the good correlation, an average inference time of 0.85 ms could mean between 1.03 and 1.30 mJ per inference. This variation in energy usage is caused by variation in power draw between small architectural changes. We think these changes have an outsized influence to energy usage because of different parallelism and CPU cache optimisations. Thereby, in the same unit of time, the CPU may work to a different level of its full capacity due to different degrees of vectorized instructions. Note, the inference time of each network is also sometimes referred to as the latency of the model in computer vision NAS literature.

3.4 Approximating energy via a random forest for NAS

In this paper, we have access to physical power measurements on Pixel devices. Measuring the energy usage of each NAS search candidate is a difficult software engineering task since the NAS search is happening in the cloud and would need to be connected to physical hardware that can automate the loading of the network, running of the network and average energy measurement. We instead opt to train a model that can accurately predict the energy usage of a given network. We then employ this model to estimate the energy usage of each NAS search candidate rather than getting physical measurements. This energy estimate is then fed into our reward function (Eq. 2) so as to help us rank NAS network candidates. At the end of our search, we gather the energy measurements of the best candidates on real hardware and report them in this paper.

Note, the alternative approach of using the inference time as an energy proxy would require us to either connect hardware in an automated way to our NAS search [21], which as discussed is a difficult engineering problem or create a model for inference time [1] to use during our search. We instead select a more direct route that avoids a complicated software/hardware engineering connection and use a model trained directly on energy usage data to predict energy usage.

We measure the average energy per inference of 15,000 architectures in our search space on the big core CPU of the Pixel 4 XL phone to train a model to predict energy usage for a given network. We employ a random forest (RF) model to predict the energy usage of models in our search space. We also tried a linear model, but it performed worse than the random forest at predicting energy usage of candidate architectures in our search space. The choice of a random forest model was motivated by the fact that decision trees are universal approximators (they can approximate any function) and random forests have been applied out the box to a wide variety of problems successfully [2, 16]. We used tenfold cross-validation to tune the random forest hyperparameters.

Fig. 3
figure 3

Random forest model fit to the energy per inference of networks in the search space

The random forest model takes as input the architecture parameters (e.g., kernel sizes/filter types of each block) as well as neural network level parameters, total FLOPs count and the TFLite memory size. The RF model has an \(R^{2}=0.92\) and RMSE of 0.07 mJ per inference (Figure  3). For reference, running a model with 1.3 mJ per inference from our search space five times on a pixel phone has an energy standard deviation (i.e., measurement noise) of 0.068 mJ. The RMSE values are close to the measurement noise from our phones, suggesting that the RF model is a good approximation for energy usage of a NN architecture. This allows us to perform NAS exploration fully on the servers, without remeasuring energy usage of each NN architecture permutation on the phone. We also tried running a linear regression model since the authors of TuNAS had success with a linear model [1]. The linear model achieved an \(R^2\) coefficient of 0.89 and an RMSE of 0.089, both significantly worse than the RF model.

3.5 Memory footprint

The SRAM available for small devices is somewhere between 10 kB and 1 MB, and this memory is shared with multiple applications. In this search, we use the TFLite executable size, i.e., the static memory of the application in our objective. In Fig. 1, we compare the TFLite executable size to the parameter count of the network. We note that despite the parameter count being well correlated (\(R^{2}\) = 0.94) to the quantized memory size of the network, it is still far from a perfect proxy. A parameter count of thirty thousand could mean anywhere between 50 and 65 kB of memory. The discrepancy is due to the integer-8 quantization-aware training with the Tensorflow framework [4] that we employ to minimize the memory usage.

4 Sound event classification dataset

We use the AudioSet dataset which contains over 2 million human-annotated 10 s sound clips derived from YouTube videos [6]. The AudioSet ontology contains more than 500 classes, but we use a subset of them to limit the complexity of our task. Specifically, we chose labels that mimic Sound Notifications on Android. The eight positive classes are (brackets indicate the original AudioSet labels, when multiple labels were mapped to one):

  • Alarms (fire alarm, smoke alarm, CO alarm)

  • Baby crying

  • Dog barking (dog, bark, yip, howl, bow-wow, growling)

  • Door knocking

  • Doorbell (doorbell, ding-dong)

  • Phone ringing

  • Sirens (emergency vehicle, police car, ambulance, fire truck)

  • Water running

We map all other classes in the AudioSet to a class labeled as the negative class. This tends to make this dataset somewhat challenging since the negative examples are all real sound events (e.g., guitar playing) and not simply low volume noise. In total, we have 9 classes with one class being negative. We use the original train/evaluation/test split from AudioSet. We also ensure that our training/evaluation/test data is comprised of 50% negative class examples. The log-mel spectrograms of the data are computed and augmented with SpecAugment [15]. We believe this mapping of AudioSet is a representative task for always-on sound event classification, while the dataset is also large enough for a NAS study.

5 Search space

Fig. 4
figure 4

A block diagram of our network architecture search process

The art of neural architecture searches lies in efficiently exploring a good search space. The search space defines the possible neural network architectures. A standard approach to define a search space is to first find a model that achieves good performance on the dataset and task of interest and decompose that model into its component blocks (e.g., a good performing network with a 5\(\times\)5 convolution, 3\(\times\)3 max-pool and 3\(\times\)3 convolution with skip connection would decompose into a search space that includes these three operations) [23]. We make use of MobileNetV1 [11] and MobileNetV2, which popularized depthwise separable convolutions, as benchmark models and use their two namesake block operations in our search space. We fix our network size to be twelve sequential blocks. We chose the number twelve after some (however not exhaustive) experimentation of how many blocks would be required to achieve MobileNetV2-like accuracy.

For every sequential block position in our network, we search for the block type, the number of output filters, and convolution kernel size. The search process is illustrated in Fig. 4. Since the optimal parameters for each block are dependent on the position (e.g., a larger kernel is not so useful when the image size becomes very small toward the end of the network), we make the possible choices position dependent. Each of the twelve blocks has between nine to thirty options to choose from. In order to ensure the image size at the end of the network is the same for all possible candidates, we fix the striding for all candidates to give a 7\(\times\)5 image at the end of the network which is then fed into an average global pooling layer, before being fed into a constant 32 node dense layer that has 9 outputs (one for each class). We use a softmax activation on the logit outputs. The block macro-architecture is defined by the striding which is kept constant for each network and can be seen in Table 1.

5.1 K\(\times\)K first block

The input to the first block is a spectrogram which has no depth dimension. This means using a more expressive block like a 2D convolution with a kernel size of K\(\times\)K is computationally affordable. As such, like the MobileNet papers, we fix the first block type to be a K\(\times\)K 2D Convolution, where K is the kernel size (i.e., an integer parameter over which the NAS algorithm should search).

5.2 Second block

The second block of our network is very important in terms of the computational load because the input image to this block is quite large since there has only been one block before to apply some striding to reduce the dimensionality. The first block acts on a 2D spectrogram. However, the second block acts on a three dimensional image (i.e., the input to this block now has depth) since the first block always applies more than one filter for all networks we consider in this work. As a result, we decide for this work to fix (hold constant) the second block’s block type to be a K\(\times\)1 depthwise convolution followed by a 1\(\times\)K depthwise convolution followed by a 1x1 pointwise convolution, which is the least computationally intense block in our search space. We call this block the K\(\times\)1-1\(\times\)K-DW block (where DW stands for depthwise). Fixing the second block not only makes our search space smaller and thus more tractable to search, but returns a search space where almost 1/2 of the networks have less than our desired 1.25 uJ/inference energy usage target.

5.3 Other blocks

The other ten blocks in our network use the block type choices of:

  • MobileNetV1

  • MobileNetV2

  • K\(\times\)1-1\(\times\)K-DW

  • MobileNetV2-Avg-Pool (only for stride (2, 2))

  • Identity (only for stride (1, 1)).

  • K\(\times\)1-1\(\times\)K (only for last block).

The last block of the network has a small input size, and as a result we also introduce the block choice of a K\(\times\)1 convolution followed by a 1\(\times\)K convolution. When the striding of a block is (1, 1), we also add the choice of the identity block. This is done to ensure the output image is always the same size of every architecture.

When using a striding of (2, 2), the original MobileNetV2 block does not contain a skip connection. The MobileNetV2 architecture is much larger than twelve blocks, and most blocks in the original paper have a skip connection. Our network macro-architecture uses striding in five of the twelve blocks. We were motivated to add a parallel path to the MobileNetV2 block since we were worried information might be lost without it. The usage of parallel paths (residual/skip connections) on blocks was popularized and explained in the ResNet paper [8]. We use a variation of the MobileNetV2 block, inspired by ShuffleNet we call MobileNetV2-Avg-Pool, so that when the striding is (2, 2) the input to the block takes a parallel path through a 3\(\times\)3 average pooling layer with stride of (2, 2) as can be seen in Fig. 5 [28].

Fig. 5
figure 5

MobileNetV2 block with an average pooling parallel path when strides are (2,2). This block variation is called MobileNetV2-Avg-Pool and is inspired by ShuffleNet

We experimented with squeeze and excite blocks such as the ones in MobileNetV3 [10]. We did not see any improvement in accuracy when adding MobileNetV3 blocks, only increases in memory footprint and energy usage. As a result, we left these blocks out of our search space.

5.4 Kernel sizes

We search for the kernel size among {3, 4, 5} for the first five blocks. After the fifth block, the input image is 13\(\times\)10 and as a result we fix the kernel size for later blocks to be 3.

5.5 Filter sizes

We choose among three options for filters for every block in the network, these choices are position dependent. We use filter sizes that are a multiple of 8 since we saw energy usage increase when using filter sizes that were not multiples of 8 on the Pixel 4 XL CPU. With these filter choices, approximately one quarter of the architectures in our search space have memory footprints smaller than 60 kB and energy usages of less than 1.25 mJ per inference.

5.6 1D variant of the search space

We also ran a modification of this search space that reduces all block types to their one-dimensional counterparts (e.g., K\(\times\)K convolutional kernel becomes a K\(\times\)1 kernel). The input spectrogram to the network is transformed by swapping the frequency axis with the depth axis as was done in the TCResNet paper [5]. This modification reduces the overall computational requirements, so we expect low energy usage, but we were uncertain whether the one-dimensional variant of our search space will return similar accuracy to MobileNetV1/V2.

6 Vizier search algorithm

Our NAS is run on Vizier [7], a black-box optimization service that removes much of the software engineering work necessary to efficiently run and analyze NAS runs. Our NAS trains each network separately and employs early stopping to eliminate architectures unlikely to contend for a top final objective value [12]. The alternative approach in NAS, training a single supernetwork with all architecture possibilities present (weight-sharing) saves computational resources, but there are no guarantees the ranking of individual networks using shared weights is valid [27]. Our search space consists of fairly small networks that take one tenth of the time to train compared to a larger network like MobileNetV2. Note, the typical network in our search has between 15 k and 40 k parameters (see Fig. 1), whereas MobileNetV2 uses more than 2000 k parameters. As a result, we did not explore other NAS search algorithms that train a single super (meta) network to avoid the expense of training candidate networks individually [1]. We instead select a more computationally intense search by training each sampled network individually to three quarters of the full training time with some networks that appear unpromising stopped early.

Our search uses Vizier’s algorithm to suggest candidate networks. Vizier suggests block types and their associated numbers of filters (filter size) for the twelve blocks in our network (note, two block types are fixed, the first and the second). For each suggested network, we calculate the memory footprint after converting the network to TFLite. We use a random forest model described earlier to predict the energy usage of the candidate network. We then train the network for a fixed number of training steps or until Vizier determines the network is not likely to be a top candidate (early stopping) while periodically evaluating the validation set accuracy. The best three candidates are retrained with no early stopping. Their accuracy on the test dataset is reported, and their energy usage is evaluated ten times on a mobile phone with the mean result and standard deviation reported. After training each sampled architecture on the training dataset, we evaluate the reward on the evaluation dataset. The architectures with best rewards are retrained five times for 33% longer on the same training set and retested on the eval dataset. The best models from each NAS are retested with the unseen test dataset, and these results are reported in this paper.

We employ two different search algorithms from Vizier for the NAS, one Bayesian and the other evolutionary, and run two thousand trials in each NAS experiment. Section 3 of Golovin’s [7] paper describes the Bayesian algorithm. The evolutionary HyperFirefly algorithm is an extension of the Firefly algorithm which uses regularization and particle swarms [26]. The Firefly hyperparameters are tuned by another Firefly algorithm every 50 iterations, using an objective metric equal to the best objective value over a sliding window of 50 iterations.

Vizier’s Bayesian algorithm slows down considerably (i.e., requires more time to produce a new suggestion) for our search space after a thousand trials. This is because a Gaussian process algorithm has \(O(N^3)\) complexity, where N is the number of parameters multiplied by the number of trials. As a result, we switch from using the Gaussian process algorithm to the HyperFirefly algorithm after one thousand trials. Combining evolutionary algorithms with Bayesian approaches has been done before [27]. The results we obtain seem to generally favor the HyperFirefly (evolutionary) algorithm over the Bayesian (hybrid) algorithm. This could be due to the evolutionary algorithm being more explorative for the first one thousand trials.

6.1 Early stopping algorithm

Vizier can decide to stop training a network early, if it finds it unpromising. After Vizier suggests an architecture to train, the memory size and the predicted energy of the architecture are sent back to Vizier. On top of that information, the model in training is periodically evaluated and the intermediate evaluation accuracy is sent back to Vizier. If Vizier’s early stopping model predicts that the current trial (architecture) will result in an objective worse than the best seen so far, with high confidence, the trial is stopped early. Early stopping or performance curve stopping in Vizier is described in section 3.2 of Golovin’s paper [7]. This rule uses a Gaussian Process (GP) with a custom kernel to regress the evaluation curves of all available trials, where each input feature to the GP is a time bucket in the time series.

Temporal spatial stopping (TSGP) learns a Gaussian Process model for each time series, using the exponential curve kernel ([20] Eq. 6). The model also learns a mean function, at the asymptote, for each time series, and a mapping from the trial parameters to kernel parameters, allowing cross-trial information sharing. This allows Vizier to make automated stopping predictions about each time series, which are informed by both a strong exponential prior, and the trial parameters.

We compared no early stopping, exponential decaying early stopping with default parameters, and exponential decaying early stopping with TSGP learned parameters. Experimentation with the three methods return very similar rewards. The TSGP early stopping used the least amount of computation (50% less than forgoing early stopping), and as a result we employed it for our search.

7 Results

Table 2 NAS results and benchmarks

Table 2 conveys energy usage, memory size and accuracy for the best NAS results and two types of baseline models: MobileNet and TCResNet. We include MobileNetV1/V2 as baseline models since they are large models relative to our search space (i.e., they will have more capacity) and they have been applied successfully to audio classification [24] tasks. The MobileNetV2 benchmark uses an expansion parameter of six. At the other end of the model size spectrum, we include TCResNet models which are known to perform well on speech command recognition and require a low number of FLOPs and static memory. We use two different TCResNet model sizes, the TCResNet8 with width multiplier of one (labeled TCResNet8-1) and TCResNet14 with width multiplier of 1.5 (TCResNet14\(-\)1.5). Note, the benchmark models had to be slightly modified from their original paper version to work with a 196 by 40 size spectrogram input since MobileNets are designed to run on square input images and the TCResNet is designed to run on a 96x40 size spectrogram. The baseline models are all quantized with the same int-8 quantization-aware training used in the NAS architectures. Table 2 reports all models’ mean task accuracy after training five times to remove any bias from the initial starting condition. NAS done with the less harsh energy penalty \(b'=\frac{0.02}{1.75}\) is marked with an accent suffix (b’) in the table. We visualize the NAS results from Table 2 in Fig. 6.

Fig. 6
figure 6

Mean energy per inference plotted against the mean quantized accuracy

NAS-HyperFirefly, the best network when using the harsh energy penalty (note the HyperFirefly suffix means the HyperFirefly regularized evolution with particle swarm algorithm was used), achieves slightly worse accuracy than MobileNetV1 but better accuracy than MobileNetV2. Compared to MobileNetV2 it uses 50x less memory usage and 10\(\times\) less energy usage. NAS-HyperFirefly-b’, which was run with a less harsh energy penalty than NAS-HyperFirefly, uses more energy than NAS-HyperFirefly but also achieves better accuracy. NAS-HyperFirefly-b’ achieves slightly better mean accuracy than MobileNetV1, the baseline model with the best mean task accuracy.

The networks using one-dimensional convolutions, as was done in TCResNet, in Table 2 tend to use very little energy but all return poor accuracy. 1D-NAS-Bayesian is the best NAS result when using only one-dimensional convolutions. It achieves significantly better accuracy than both TCResNet baselines but with slightly more energy usage than TCResNet8\(-\)1.0.

The mean energy per inferences measured on the Pixel 4 XL CPU are not far from the predictions of the random forest model used during the NAS search. For example, the NAS-HyperFirefly uses 1.27 mJ per inference on average where the RF prediction used by Vizier during the NAS was 1.35 mJ. Similarly, the NAS-HyperFirefly-b’ uses on average 1.72 mJ per inference and the RF prediction was 1.69 mJ.

8 Discussion

Table 1 shows the best performing network structure found by NAS-Bayesian. Of interest is that the network uses the kernel size 5 twice in the network—the larger receptive field must allow the network to improve the task accuracy despite costing more computationally. The network uses both MobileNetV1 and V2 blocks, and the new MobileNetV2-Avg-Pool and 1\(\times\)K-K\(\times\)1-DW block are introduced in this paper. This block type heterogeneity agrees with what many NAS authors have found that it can be beneficial to have different types of block structures [3]. It also shows MobileNetV2 blocks are not always superior to MobileNetV1 blocks when accuracy, energy and memory usage are all taken into account.

The number of output filters in the first block of the network creates a computational bottleneck when using 2D convolutions on a spectrogram. Table 1 shows the FLOPs of each block in the optimum NAS-Bayesian network, and we see the second block uses 2 M FLOPs. This is despite the block having 8 input filters and using a kernel size of 3 and 24 output filters. If the number of input filters to the second block was instead 24, the FLOPs count would triple to 6 M.

The computational burden of the number of output filters in the first block is also seen in the energy usage of the network. The two most important features of our RF model that predicts energy are: the number of filters in the first block (59%) and total number of FLOPs in the network (30%). The rest of the features had 2% or less impact. This shows the importance of the number of output filters used by the first block which creates a computational bottleneck in the network when using a spectrogram input to the network.

MobileNetV1 performs better than MobileNetV2 (expansion of 6) on this dataset. One of the main differences MobileNetV1 has to MobileNetV2 is a 1000-node dense embedding layer at the end of the network. The lack of the embedding layer may partly explain the poorer performance of MobileNetV2 which uses 700 kB less memory than MobileNetV1.

We note the poor accuracy performance of the 1D NAS variants was to be expected since the input image to the second block is now a 2D image (no longer 3D). However, the 1D-NAS-Bayesian using 0.26 mJ per inference compared to 1.3 mJ for the NAS-NAS-Bayesian model. For some applications, such as those deployed on batteries smaller than a mobile phone (earphones, smart watches) we can envision the 1D model being favored for using five times as little energy per inference. For such low-power use cases, we suggest further research into expanding the search space to use 1D convolutional blocks and/or combining this approach with model compression techniques (e.g., weight pruning).

Our search took roughly 15,000 GPU hours for the 2D NAS search sampling 2000 candidate networks. We used about one fifth that time for the 1D NAS variants. We did not experiment with other NAS methods to reduce the search’s computational burden, which is something we would like to explore in the future.

The NAS approach presented in this paper succeeds in finding a model (NAS-HyperFirefly) that is 10x more energy efficient and gives an improvement in absolute mean accuracy of 0.94% compared to MobileNetV2. In comparison to MobileNetV1, we find a 4x more energy-efficient network (NAS-HyperFirefly-b’) that uses more than 75x less memory and achieves 0.03% improvement in mean absolute accuracy. For always-on audio classification, we have shown this approach of incorporating on-device energy usage into the NAS reward function through a weighted sum is successful at optimizing the combination of energy efficiency, accuracy and memory usage of always-on models.

We believe this approach is likely general enough to be transferable to other domains (e.g., portable biomedical devices, video processing, sensor fusion). We believe the need for finding machine learning applications where for a given accuracy the energy and memory usage is as small as possible (e.g., Pareto optimal) is important to allow for larger and better networks to be deployed on lightweight batteries and small footprint and cheap SRAM chips. The use of Vizier in this study significantly removes software barriers to NAS adoption, and we advocate its use in future studies due to ease of use of the API.