Neural architecture search for energy-efficient always-on audio machine learning

Speckhard, Daniel T.; Misiunas, Karolis; Perel, Sagi; Zhu, Tenghui; Carlile, Simon; Slaney, Malcolm

doi:10.1007/s00521-023-08345-y

Neural architecture search for energy-efficient always-on audio machine learning

Original Article
Open access
Published: 20 February 2023

Volume 35, pages 12133–12144, (2023)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Neural architecture search for energy-efficient always-on audio machine learning

Download PDF

Daniel T. Speckhard ORCID: orcid.org/0000-0002-9849-0022¹,
Karolis Misiunas²,
Sagi Perel²,
Tenghui Zhu²,
Simon Carlile¹ &
…
Malcolm Slaney²

1368 Accesses
3 Citations
Explore all metrics

Abstract

Mobile and edge computing devices for always-on classification tasks require energy-efficient neural network architectures. In this paper we present several changes to neural architecture searches that improve the chance of success in practical situations. Our search simultaneously optimizes for network accuracy, energy efficiency and memory usage. We benchmark the performance of our search on real hardware, but since running thousands of tests with real hardware is difficult, we use a random forest model to roughly predict the energy usage of a candidate network. We present a search strategy that uses both Bayesian and regularized evolutionary search with particle swarms, and employs early stopping to reduce the computational burden. Our search, evaluated on a sound event classification dataset based upon AudioSet, results in an order of magnitude less energy per inference and a much smaller memory footprint than our baseline MobileNetV1/V2 implementations while slightly improving task accuracy. We also demonstrate how combining a 2D spectrogram with a convolution with many filters causes a computational bottleneck for audio classification and that alternative approaches reduce the computational burden but sacrifice task accuracy.

Towards the Design and Evaluation of Robust Audio-Sensing Systems

Urban sound classification using neural networks on embedded FPGAs

Article Open access 01 March 2024

Neuromorphic Spiking Neural Network Algorithms

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

It is challenging for the hearing impaired to identify important sounds such as running water, dogs barking, and crying babies. Typically, sound event classification systems feed spectrograms into image classification networks with great results [9]. Much of the sound event classification work focuses on a paradigm where audio is sent over an Internet connection to a large neural network (such as ResNet50 [8] with 20+ million parameters) that classifies the sound in the cloud. This approach relies on good internet data speeds. We instead search for a neural network that runs locally for a full day, continuously, on a battery powered device (e.g., smart watch, earphones, phone). This requires an energy-efficient network to avoid draining the battery of the device prematurely and a small enough network to be able to fit into device memory.

Much of the platform aware neural architecture search (NAS) literature has focused on inference time (latency) as a user experience requirement for image classification [1, 21]. Instead we think energy usage is the more important limiting factor.

An always-on audio model calculating an inference once every second makes 86,400 inferences per day. As a result, the energy required per model inference is a critical matter when searching for the best architecture. A smartphone might have a battery capacity of around 51 kJ (e.g., Google Pixel 4 XL), a smartwatch around 3.6 kJ (e.g., Fitbit Versa 3) and earphones around 0.7 kJ, (e.g., Pixel Buds 2). For comparison, the baseline solution of deploying a high-performance network like MobileNetV2 [18] on a Pixel 4 XL big core CPU uses 14 mJ per inference (1.21 kJ per day) when running sound event classification on spectrograms. It’s evident that a network of that size will quickly consume a small device’s battery capacity.

We introduce the first neural architecture search that incorporates the energy usage of the implementation. Our search also minimizes the memory usage of the neural network which, similar to energy usage, can be an equally limiting factor to model deployment on mobile and edge computing devices where total SRAM is limited. Our NAS builds upon related hardware-constrained searches (e.g., searches constrained by hardware limitations such as memory). To find networks that also optimize for low energy and memory usage, we need to incorporate these constraints into our reward function which we discuss in Sect. 3.

We would like to guide our NAS with real hardware energy measurements. But at the scale we are operating (thousands of evaluations per task) this is prohibitive. In this work, we train a random forest model on 10,000 candidate architectures from our search to accurately predict the energy usage of a candidate architecture. We choose a random forest model since random forests are known to work well on a variety of problem domains [2]. We also found that the random forest model outperforms a linear model. After running our search, we run the top three performing neural architectures on Pixel 4 XL CPUs five times to get average energy usage statistics.

We benchmark our work by comparing it with the efficiency of a state-of-the-art network, in this case MobileNetV2, which performed well on the related task of 2D image classification (note, in this task we are classifying 2D spectrograms) [18]. Our NAS focuses on an audio classification task where the task and dataset are defined in Sect. 4. We constrain the maximum number (12) of sequential operations in a candidate neural network. For each sequential position in our candidate network, our search suggests an operation (e.g., either a 3$\times$3 convolution or a 5$\times$5 convolution). The possible operations in a NAS (described here as the search space) are often defined based on what operations are found in state-of-the-art models on the task/dataset (in this case MobileNetV2). We discuss the search space more in Sect. 5.

Our search algorithm, evolutionary and Bayesian search algorithms defined in Sect. 6, suggests collections of block operations which define a candidate neural network. The search algorithm seeks to suggest network architectures that optimize a reward function which scores each neural network candidate. We use early stopping, where we stop the training of unpromising architecture candidates to reduce the computational burden of the search.

In sum, we present a simple to implement neural architecture search that targets on-device energy efficiency and low memory usage for always-on audio models to satisfy the constraints outlined above. Our main contributions are:

We introduce a multi-objective neural architecture search that optimizes not only accuracy but also memory and energy usage. We employ both Bayesian and evolutionary search algorithms returning slightly better results.
We train a random forest model to predict the energy usage of candidate neural network architectures in our search space. The model achieves a RMSE of 0.07 mJ per inference, which is a small fraction of typical energy usage per inference of our search space. This allows us to perform an architecture search that includes energy usage estimates without the added complication of including hardware in the search loop. Note, after the search we verify the performance of the winning implementations to make sure we have a real winner.
We evaluate our method on a MobileNet-based search space and find a model with accuracy slightly better than MobileNetV2 with 10$\times$ less energy usage, and 50$\times$ smaller memory footprint (Table 2).
We show FLOPs are not a good proxy for energy usage even on a mobile CPU. Inference time (latency) is a better but imperfect proxy for energy usage. This is because power usage is not consistent across neural networks—memory access and arithmetic operations differ in power (Fig. 2).
Our search identifies a computational bottleneck created by combining spectrograms with 2D convolutional blocks (Table 1) which is the typical architecture for audio classification. We show that an alternative approach of swapping the frequency axis with the depth axis of the spectrogram and using 1D convolutional blocks reduces energy usage, but significantly underperforms on the accuracy metric.

2 Related work

Several papers have explored neural architecture searches for neural networks intended for mobile devices. In particular, the MNAS paper of Tan [21] performed their NAS where the inference time (latency) of architecture was included in the search reward function. Their search included a mobile phone in the search loop, where the candidate architectures ran on a mobile phone to return latency measurement.

In the TuNAS paper of Bender [1], the authors avoided using hardware in the search loop to reduce software/hardware engineering requirements since it is significant work to connect mobile phones and measurement devices to the cloud where the NAS takes place. They instead opted to use a linear model to predict the inference time of neural network architectures in their search space. They train a linear model to predict the inference time of each architecture suggested by their search to rank each candidate architecture.

We, on the other hand, target energy per inference instead of inference time and also include a third term, memory usage in our reward function. Similar to TuNAS, we opt to avoid using hardware in the loop, and instead train a model (we use a random forest model instead of a linear model since it performs better) to predict the energy usage of each network architecture in our search space.

TuNAS’ search algorithm creates a super (meta) network that includes all possible architectures into one network. It then drops out entire paths during training. This search algorithm is very efficient since only one (large) architecture needs to be trained instead of many possible candidate architectures trained separately. After training, they then mask their network so that only a single path in the super network is active so they can score a single network’s performance using the super network's trained weights. However, for the trained weights of the supernetwork to be similar to the weights of the lone architecture, significant paths in this network during training needed to be dropped out (turned off) which can make training unstable. The paper’s authors also made this observation. This instability in training is the reason we instead opted to use Vizier’s algorithms which likely sacrifice computation time during the search.

MNAS and TuNAS use reinforcement learning (RL) to suggest new architecture candidates, whereas we opt to use both a Bayesian and genetic algorithms. We made this decision since NAS literature has shown evolutionary algorithms should yield similar results to RL for image classification tasks which we expect to behave similarly to our task of audio classification tasks on 2D spectrogram images [17]. The Bayesian and genetic algorithms are also easier to set up out of the box.

Our search space is the most similar to Wu’s FBNet paper [23]. The authors in this paper used a search space of different block operations. The search there was over the convolution kernel size, number of filters and expansion parameter of each block in the architecture. We similarly search over the kernel size and number of filters. However, FBNet only searches over MobileNetV2-like operation blocks where we instead include MobileNetV1 blocks and other smaller block types, which we hypothesize might be more energy-efficient. Similar to FBNet, TuNAS also builds a MobileNetV2-based search space. We use a smaller maximum network size than TuNAS and FBNet since we are targeting more energy-efficient and memory-efficient networks. We summarize our NAS search features and compare them to related hardware-constrained searches in Table 1. One difference we don’t include in the table to make it easier to visualize is that we focus on audio classification rather than image classification in our search.

Table 1 Comparison of related hardware-constrained neural architecture searches

Full size table

3 Optimization criteria

We need to find a neural network that finds a balance between energy efficiency and memory usage, while still achieving state-of-the-art accuracy. One option for our search would be to optimize accuracy while treating memory usage and energy usage as hard constraints. This yields Eq. 1 where x is the evaluation dataset, ACC is the accuracy of a candidate network h in our NAS search space H, $\text {MEM}$ is the memory footprint, and $\text {ENERGY}$ is the energy usage per inference of the network.

$$\begin{aligned} \begin{aligned} \min _{h \in H} \quad&\mathrm{{ACC}}(h(x))\\ \text {s.t.} \quad&\mathrm{{MEM}}(h) \le M_{0}\\&\mathrm{{ENERGY}}(h) \le E_{0} \\ \end{aligned} \end{aligned}$$

(1)

As noted by the MNASNet authors, this approach maximizes a single metric and does not yield multiple Pareto optimal curves [21]. We are looking for Pareto optimal models (e.g., models which have the maximum accuracy without increasing memory and energy usage). To approximate the Pareto optimal solutions, we combine these optimization constraints into a single objective via a weighted sum (note, MNASNet used a weighted product). In addition, we do not need the absolute lowest energy or memory, and thus we limit the loss below an arbitrary threshold. The reward in Eq. 2 proportionally penalizes larger memory sizes and energy usages.

$$\begin{aligned}{} & {} \, R = \mathrm{{ACC}}(h(x)) \, \, - b\,\max (0, \text {ENERGY}(h)-E_0) \nonumber \\{} & {} \quad - c\,\max (0, \text {MEM}(h)-M_0) \end{aligned}$$

(2)

Memory and energy usage is penalized with a ReLU function that activates after the thresholds, $M_0$ and $E_0$, respectively, are crossed. In this study, we use an energy threshold, $E_0$, of 1.25 mJ per inference, which amounts to slightly more than 0.2% of the Pixel 4 XL battery when the network is running one inference per second all day. Above the energy threshold, $E_0$, we explored two different slopes: b and b’. The harsher penalty b is set to $\frac{0.02}{0.75~\mathrm{{mJ}}}$. Thus, above this energy threshold a 0.75 mJ increase in energy per inference must give at least a 2% increase in accuracy for the same reward. The less harsh penalty sets $b'=\frac{0.02}{1.75~\mathrm{{mJ}}}$.

We use a memory size threshold, $M_0$, of 60 kB, above which larger memory sizes are penalized with slope $c=\frac{0.02}{30~\text {kB}}$. The 60 kB threshold is chosen to allow the network to be deployed on a wide variety of SRAM limited devices (e.g., smartphones, smartwatches and earphones) which are expected to have several machine learning applications running simultaneously. The chosen slope means that a 90 kB model must have at least 0.02 accuracy points more than a 60 kB model for it to have a better reward. In the next five subsections, we discuss the quantized accuracy, measuring physical energy usage, approximate energy usage metrics, how we approximate energy usage during NAS using a random forest, and finally the memory usage in the reward function in Eq. 2.

The following subsections give details of the quantized network (to avoid expensive floating point operations), energy approximations and memory usage.

3.1 Quantized accuracy

To measure the performance of the candidate architectures, we use the accuracy of the 8-bit integer quantized TFLite model. The network is quantized using integer-8 quantization-aware training with the Tensorflow framework [4] to minimize the memory and energy usage. There is generally good agreement between the non-quantized accuracy and the quantized accuracy (correlation of 0.955), but there are some outliers (up to 6.5% disagreement in accuracy) as seen in Fig. 1. Since we are targeting on-device inference, we use the quantized accuracy in our reward function.

3.2 Physical energy measurements

We use a Monsoon power monitor [14] to measure the average power draw of a phone (without battery) running a candidate architecture. The energy per network inference is platform dependent, and thus for this paper we focus on the big core CPU of the Pixel 4 XL. During the measurement, we lock the CPU core frequency and use a single thread. The average inference time is measured using the TFLite benchmarking tool. We use these energy measurements in three ways: to check the approximations others have used (Sect. 3.3), to train an approximate model to help guide the NAS (Sect. 3.4), and finally to verify the energy measurements shown in this paper (by repeating the measurement 5 times and reporting the mean and standard deviation).

3.3 Inference time and FLOPs as energy proxies

Other papers have used FLOPs (total number of floating point operations of the unquantized model) or inference time (latency) to approximate energy usage [12, 13]. We discuss the drawbacks to these approaches. Figure 2 shows that a network with a FLOP count of 10 million might use between 0.6 mJ per inference and 1.5 mJ per inference. This agrees with what several authors have reported that the FLOPs count is a poor proxy for energy usage on-device, likely due to memory access not being accounted for in the FLOPs count [25].

We find the correlation between the average inference time and the average energy per inference (which is simply the average inference time multiplied by the average power) is 0.989 over our search space (Fig. 2). Despite the good correlation, an average inference time of 0.85 ms could mean between 1.03 and 1.30 mJ per inference. This variation in energy usage is caused by variation in power draw between small architectural changes. We think these changes have an outsized influence to energy usage because of different parallelism and CPU cache optimisations. Thereby, in the same unit of time, the CPU may work to a different level of its full capacity due to different degrees of vectorized instructions. Note, the inference time of each network is also sometimes referred to as the latency of the model in computer vision NAS literature.

3.4 Approximating energy via a random forest for NAS

In this paper, we have access to physical power measurements on Pixel devices. Measuring the energy usage of each NAS search candidate is a difficult software engineering task since the NAS search is happening in the cloud and would need to be connected to physical hardware that can automate the loading of the network, running of the network and average energy measurement. We instead opt to train a model that can accurately predict the energy usage of a given network. We then employ this model to estimate the energy usage of each NAS search candidate rather than getting physical measurements. This energy estimate is then fed into our reward function (Eq. 2) so as to help us rank NAS network candidates. At the end of our search, we gather the energy measurements of the best candidates on real hardware and report them in this paper.

Note, the alternative approach of using the inference time as an energy proxy would require us to either connect hardware in an automated way to our NAS search [21], which as discussed is a difficult engineering problem or create a model for inference time [1] to use during our search. We instead select a more direct route that avoids a complicated software/hardware engineering connection and use a model trained directly on energy usage data to predict energy usage.

We measure the average energy per inference of 15,000 architectures in our search space on the big core CPU of the Pixel 4 XL phone to train a model to predict energy usage for a given network. We employ a random forest (RF) model to predict the energy usage of models in our search space. We also tried a linear model, but it performed worse than the random forest at predicting energy usage of candidate architectures in our search space. The choice of a random forest model was motivated by the fact that decision trees are universal approximators (they can approximate any function) and random forests have been applied out the box to a wide variety of problems successfully [2, 16]. We used tenfold cross-validation to tune the random forest hyperparameters.

The random forest model takes as input the architecture parameters (e.g., kernel sizes/filter types of each block) as well as neural network level parameters, total FLOPs count and the TFLite memory size. The RF model has an $R^{2}=0.92$ and RMSE of 0.07 mJ per inference (Figure 3). For reference, running a model with 1.3 mJ per inference from our search space five times on a pixel phone has an energy standard deviation (i.e., measurement noise) of 0.068 mJ. The RMSE values are close to the measurement noise from our phones, suggesting that the RF model is a good approximation for energy usage of a NN architecture. This allows us to perform NAS exploration fully on the servers, without remeasuring energy usage of each NN architecture permutation on the phone. We also tried running a linear regression model since the authors of TuNAS had success with a linear model [1]. The linear model achieved an $R^2$ coefficient of 0.89 and an RMSE of 0.089, both significantly worse than the RF model.

3.5 Memory footprint

The SRAM available for small devices is somewhere between 10 kB and 1 MB, and this memory is shared with multiple applications. In this search, we use the TFLite executable size, i.e., the static memory of the application in our objective. In Fig. 1, we compare the TFLite executable size to the parameter count of the network. We note that despite the parameter count being well correlated ($R^{2}$ = 0.94) to the quantized memory size of the network, it is still far from a perfect proxy. A parameter count of thirty thousand could mean anywhere between 50 and 65 kB of memory. The discrepancy is due to the integer-8 quantization-aware training with the Tensorflow framework [4] that we employ to minimize the memory usage.

4 Sound event classification dataset

We use the AudioSet dataset which contains over 2 million human-annotated 10 s sound clips derived from YouTube videos [6]. The AudioSet ontology contains more than 500 classes, but we use a subset of them to limit the complexity of our task. Specifically, we chose labels that mimic Sound Notifications on Android. The eight positive classes are (brackets indicate the original AudioSet labels, when multiple labels were mapped to one):

Alarms (fire alarm, smoke alarm, CO alarm)
Baby crying
Dog barking (dog, bark, yip, howl, bow-wow, growling)
Door knocking
Doorbell (doorbell, ding-dong)
Phone ringing
Sirens (emergency vehicle, police car, ambulance, fire truck)
Water running

We map all other classes in the AudioSet to a class labeled as the negative class. This tends to make this dataset somewhat challenging since the negative examples are all real sound events (e.g., guitar playing) and not simply low volume noise. In total, we have 9 classes with one class being negative. We use the original train/evaluation/test split from AudioSet. We also ensure that our training/evaluation/test data is comprised of 50% negative class examples. The log-mel spectrograms of the data are computed and augmented with SpecAugment [15]. We believe this mapping of AudioSet is a representative task for always-on sound event classification, while the dataset is also large enough for a NAS study.

5 Search space

The art of neural architecture searches lies in efficiently exploring a good search space. The search space defines the possible neural network architectures. A standard approach to define a search space is to first find a model that achieves good performance on the dataset and task of interest and decompose that model into its component blocks (e.g., a good performing network with a 5$\times$5 convolution, 3$\times$3 max-pool and 3$\times$3 convolution with skip connection would decompose into a search space that includes these three operations) [23]. We make use of MobileNetV1 [11] and MobileNetV2, which popularized depthwise separable convolutions, as benchmark models and use their two namesake block operations in our search space. We fix our network size to be twelve sequential blocks. We chose the number twelve after some (however not exhaustive) experimentation of how many blocks would be required to achieve MobileNetV2-like accuracy.

For every sequential block position in our network, we search for the block type, the number of output filters, and convolution kernel size. The search process is illustrated in Fig. 4. Since the optimal parameters for each block are dependent on the position (e.g., a larger kernel is not so useful when the image size becomes very small toward the end of the network), we make the possible choices position dependent. Each of the twelve blocks has between nine to thirty options to choose from. In order to ensure the image size at the end of the network is the same for all possible candidates, we fix the striding for all candidates to give a 7$\times$5 image at the end of the network which is then fed into an average global pooling layer, before being fed into a constant 32 node dense layer that has 9 outputs (one for each class). We use a softmax activation on the logit outputs. The block macro-architecture is defined by the striding which is kept constant for each network and can be seen in Table 1.

5.1 K$\times$K first block

The input to the first block is a spectrogram which has no depth dimension. This means using a more expressive block like a 2D convolution with a kernel size of K$\times$K is computationally affordable. As such, like the MobileNet papers, we fix the first block type to be a K$\times$K 2D Convolution, where K is the kernel size (i.e., an integer parameter over which the NAS algorithm should search).

5.2 Second block

The second block of our network is very important in terms of the computational load because the input image to this block is quite large since there has only been one block before to apply some striding to reduce the dimensionality. The first block acts on a 2D spectrogram. However, the second block acts on a three dimensional image (i.e., the input to this block now has depth) since the first block always applies more than one filter for all networks we consider in this work. As a result, we decide for this work to fix (hold constant) the second block’s block type to be a K$\times$1 depthwise convolution followed by a 1$\times$K depthwise convolution followed by a 1x1 pointwise convolution, which is the least computationally intense block in our search space. We call this block the K$\times$1-1$\times$K-DW block (where DW stands for depthwise). Fixing the second block not only makes our search space smaller and thus more tractable to search, but returns a search space where almost 1/2 of the networks have less than our desired 1.25 uJ/inference energy usage target.

5.3 Other blocks

The other ten blocks in our network use the block type choices of:

MobileNetV1
MobileNetV2
K$\times$1-1$\times$K-DW
MobileNetV2-Avg-Pool (only for stride (2, 2))
Identity (only for stride (1, 1)).
K$\times$1-1$\times$K (only for last block).

The last block of the network has a small input size, and as a result we also introduce the block choice of a K$\times$1 convolution followed by a 1$\times$K convolution. When the striding of a block is (1, 1), we also add the choice of the identity block. This is done to ensure the output image is always the same size of every architecture.

When using a striding of (2, 2), the original MobileNetV2 block does not contain a skip connection. The MobileNetV2 architecture is much larger than twelve blocks, and most blocks in the original paper have a skip connection. Our network macro-architecture uses striding in five of the twelve blocks. We were motivated to add a parallel path to the MobileNetV2 block since we were worried information might be lost without it. The usage of parallel paths (residual/skip connections) on blocks was popularized and explained in the ResNet paper [8]. We use a variation of the MobileNetV2 block, inspired by ShuffleNet we call MobileNetV2-Avg-Pool, so that when the striding is (2, 2) the input to the block takes a parallel path through a 3$\times$3 average pooling layer with stride of (2, 2) as can be seen in Fig. 5 [28].

We experimented with squeeze and excite blocks such as the ones in MobileNetV3 [10]. We did not see any improvement in accuracy when adding MobileNetV3 blocks, only increases in memory footprint and energy usage. As a result, we left these blocks out of our search space.

5.4 Kernel sizes

We search for the kernel size among {3, 4, 5} for the first five blocks. After the fifth block, the input image is 13$\times$10 and as a result we fix the kernel size for later blocks to be 3.

5.5 Filter sizes

We choose among three options for filters for every block in the network, these choices are position dependent. We use filter sizes that are a multiple of 8 since we saw energy usage increase when using filter sizes that were not multiples of 8 on the Pixel 4 XL CPU. With these filter choices, approximately one quarter of the architectures in our search space have memory footprints smaller than 60 kB and energy usages of less than 1.25 mJ per inference.

5.6 1D variant of the search space

We also ran a modification of this search space that reduces all block types to their one-dimensional counterparts (e.g., K$\times$K convolutional kernel becomes a K$\times$1 kernel). The input spectrogram to the network is transformed by swapping the frequency axis with the depth axis as was done in the TCResNet paper [5]. This modification reduces the overall computational requirements, so we expect low energy usage, but we were uncertain whether the one-dimensional variant of our search space will return similar accuracy to MobileNetV1/V2.

6 Vizier search algorithm

Our NAS is run on Vizier [7], a black-box optimization service that removes much of the software engineering work necessary to efficiently run and analyze NAS runs. Our NAS trains each network separately and employs early stopping to eliminate architectures unlikely to contend for a top final objective value [12]. The alternative approach in NAS, training a single supernetwork with all architecture possibilities present (weight-sharing) saves computational resources, but there are no guarantees the ranking of individual networks using shared weights is valid [27]. Our search space consists of fairly small networks that take one tenth of the time to train compared to a larger network like MobileNetV2. Note, the typical network in our search has between 15 k and 40 k parameters (see Fig. 1), whereas MobileNetV2 uses more than 2000 k parameters. As a result, we did not explore other NAS search algorithms that train a single super (meta) network to avoid the expense of training candidate networks individually [1]. We instead select a more computationally intense search by training each sampled network individually to three quarters of the full training time with some networks that appear unpromising stopped early.

Our search uses Vizier’s algorithm to suggest candidate networks. Vizier suggests block types and their associated numbers of filters (filter size) for the twelve blocks in our network (note, two block types are fixed, the first and the second). For each suggested network, we calculate the memory footprint after converting the network to TFLite. We use a random forest model described earlier to predict the energy usage of the candidate network. We then train the network for a fixed number of training steps or until Vizier determines the network is not likely to be a top candidate (early stopping) while periodically evaluating the validation set accuracy. The best three candidates are retrained with no early stopping. Their accuracy on the test dataset is reported, and their energy usage is evaluated ten times on a mobile phone with the mean result and standard deviation reported. After training each sampled architecture on the training dataset, we evaluate the reward on the evaluation dataset. The architectures with best rewards are retrained five times for 33% longer on the same training set and retested on the eval dataset. The best models from each NAS are retested with the unseen test dataset, and these results are reported in this paper.

We employ two different search algorithms from Vizier for the NAS, one Bayesian and the other evolutionary, and run two thousand trials in each NAS experiment. Section 3 of Golovin’s [7] paper describes the Bayesian algorithm. The evolutionary HyperFirefly algorithm is an extension of the Firefly algorithm which uses regularization and particle swarms [26]. The Firefly hyperparameters are tuned by another Firefly algorithm every 50 iterations, using an objective metric equal to the best objective value over a sliding window of 50 iterations.

Vizier’s Bayesian algorithm slows down considerably (i.e., requires more time to produce a new suggestion) for our search space after a thousand trials. This is because a Gaussian process algorithm has $O(N^3)$ complexity, where N is the number of parameters multiplied by the number of trials. As a result, we switch from using the Gaussian process algorithm to the HyperFirefly algorithm after one thousand trials. Combining evolutionary algorithms with Bayesian approaches has been done before [27]. The results we obtain seem to generally favor the HyperFirefly (evolutionary) algorithm over the Bayesian (hybrid) algorithm. This could be due to the evolutionary algorithm being more explorative for the first one thousand trials.

6.1 Early stopping algorithm

Vizier can decide to stop training a network early, if it finds it unpromising. After Vizier suggests an architecture to train, the memory size and the predicted energy of the architecture are sent back to Vizier. On top of that information, the model in training is periodically evaluated and the intermediate evaluation accuracy is sent back to Vizier. If Vizier’s early stopping model predicts that the current trial (architecture) will result in an objective worse than the best seen so far, with high confidence, the trial is stopped early. Early stopping or performance curve stopping in Vizier is described in section 3.2 of Golovin’s paper [7]. This rule uses a Gaussian Process (GP) with a custom kernel to regress the evaluation curves of all available trials, where each input feature to the GP is a time bucket in the time series.

Temporal spatial stopping (TSGP) learns a Gaussian Process model for each time series, using the exponential curve kernel ([20] Eq. 6). The model also learns a mean function, at the asymptote, for each time series, and a mapping from the trial parameters to kernel parameters, allowing cross-trial information sharing. This allows Vizier to make automated stopping predictions about each time series, which are informed by both a strong exponential prior, and the trial parameters.

We compared no early stopping, exponential decaying early stopping with default parameters, and exponential decaying early stopping with TSGP learned parameters. Experimentation with the three methods return very similar rewards. The TSGP early stopping used the least amount of computation (50% less than forgoing early stopping), and as a result we employed it for our search.

7 Results

Table 2 NAS results and benchmarks

Full size table

Table 2 conveys energy usage, memory size and accuracy for the best NAS results and two types of baseline models: MobileNet and TCResNet. We include MobileNetV1/V2 as baseline models since they are large models relative to our search space (i.e., they will have more capacity) and they have been applied successfully to audio classification [24] tasks. The MobileNetV2 benchmark uses an expansion parameter of six. At the other end of the model size spectrum, we include TCResNet models which are known to perform well on speech command recognition and require a low number of FLOPs and static memory. We use two different TCResNet model sizes, the TCResNet8 with width multiplier of one (labeled TCResNet8-1) and TCResNet14 with width multiplier of 1.5 (TCResNet14$-$1.5). Note, the benchmark models had to be slightly modified from their original paper version to work with a 196 by 40 size spectrogram input since MobileNets are designed to run on square input images and the TCResNet is designed to run on a 96x40 size spectrogram. The baseline models are all quantized with the same int-8 quantization-aware training used in the NAS architectures. Table 2 reports all models’ mean task accuracy after training five times to remove any bias from the initial starting condition. NAS done with the less harsh energy penalty $b'=\frac{0.02}{1.75}$ is marked with an accent suffix (b’) in the table. We visualize the NAS results from Table 2 in Fig. 6.

NAS-HyperFirefly, the best network when using the harsh energy penalty (note the HyperFirefly suffix means the HyperFirefly regularized evolution with particle swarm algorithm was used), achieves slightly worse accuracy than MobileNetV1 but better accuracy than MobileNetV2. Compared to MobileNetV2 it uses 50x less memory usage and 10$\times$ less energy usage. NAS-HyperFirefly-b’, which was run with a less harsh energy penalty than NAS-HyperFirefly, uses more energy than NAS-HyperFirefly but also achieves better accuracy. NAS-HyperFirefly-b’ achieves slightly better mean accuracy than MobileNetV1, the baseline model with the best mean task accuracy.

The networks using one-dimensional convolutions, as was done in TCResNet, in Table 2 tend to use very little energy but all return poor accuracy. 1D-NAS-Bayesian is the best NAS result when using only one-dimensional convolutions. It achieves significantly better accuracy than both TCResNet baselines but with slightly more energy usage than TCResNet8$-$1.0.

The mean energy per inferences measured on the Pixel 4 XL CPU are not far from the predictions of the random forest model used during the NAS search. For example, the NAS-HyperFirefly uses 1.27 mJ per inference on average where the RF prediction used by Vizier during the NAS was 1.35 mJ. Similarly, the NAS-HyperFirefly-b’ uses on average 1.72 mJ per inference and the RF prediction was 1.69 mJ.

8 Discussion

Table 1 shows the best performing network structure found by NAS-Bayesian. Of interest is that the network uses the kernel size 5 twice in the network—the larger receptive field must allow the network to improve the task accuracy despite costing more computationally. The network uses both MobileNetV1 and V2 blocks, and the new MobileNetV2-Avg-Pool and 1$\times$K-K$\times$1-DW block are introduced in this paper. This block type heterogeneity agrees with what many NAS authors have found that it can be beneficial to have different types of block structures [3]. It also shows MobileNetV2 blocks are not always superior to MobileNetV1 blocks when accuracy, energy and memory usage are all taken into account.

The number of output filters in the first block of the network creates a computational bottleneck when using 2D convolutions on a spectrogram. Table 1 shows the FLOPs of each block in the optimum NAS-Bayesian network, and we see the second block uses 2 M FLOPs. This is despite the block having 8 input filters and using a kernel size of 3 and 24 output filters. If the number of input filters to the second block was instead 24, the FLOPs count would triple to 6 M.

The computational burden of the number of output filters in the first block is also seen in the energy usage of the network. The two most important features of our RF model that predicts energy are: the number of filters in the first block (59%) and total number of FLOPs in the network (30%). The rest of the features had 2% or less impact. This shows the importance of the number of output filters used by the first block which creates a computational bottleneck in the network when using a spectrogram input to the network.

MobileNetV1 performs better than MobileNetV2 (expansion of 6) on this dataset. One of the main differences MobileNetV1 has to MobileNetV2 is a 1000-node dense embedding layer at the end of the network. The lack of the embedding layer may partly explain the poorer performance of MobileNetV2 which uses 700 kB less memory than MobileNetV1.

We note the poor accuracy performance of the 1D NAS variants was to be expected since the input image to the second block is now a 2D image (no longer 3D). However, the 1D-NAS-Bayesian using 0.26 mJ per inference compared to 1.3 mJ for the NAS-NAS-Bayesian model. For some applications, such as those deployed on batteries smaller than a mobile phone (earphones, smart watches) we can envision the 1D model being favored for using five times as little energy per inference. For such low-power use cases, we suggest further research into expanding the search space to use 1D convolutional blocks and/or combining this approach with model compression techniques (e.g., weight pruning).

Our search took roughly 15,000 GPU hours for the 2D NAS search sampling 2000 candidate networks. We used about one fifth that time for the 1D NAS variants. We did not experiment with other NAS methods to reduce the search’s computational burden, which is something we would like to explore in the future.

The NAS approach presented in this paper succeeds in finding a model (NAS-HyperFirefly) that is 10x more energy efficient and gives an improvement in absolute mean accuracy of 0.94% compared to MobileNetV2. In comparison to MobileNetV1, we find a 4x more energy-efficient network (NAS-HyperFirefly-b’) that uses more than 75x less memory and achieves 0.03% improvement in mean absolute accuracy. For always-on audio classification, we have shown this approach of incorporating on-device energy usage into the NAS reward function through a weighted sum is successful at optimizing the combination of energy efficiency, accuracy and memory usage of always-on models.

We believe this approach is likely general enough to be transferable to other domains (e.g., portable biomedical devices, video processing, sensor fusion). We believe the need for finding machine learning applications where for a given accuracy the energy and memory usage is as small as possible (e.g., Pareto optimal) is important to allow for larger and better networks to be deployed on lightweight batteries and small footprint and cheap SRAM chips. The use of Vizier in this study significantly removes software barriers to NAS adoption, and we advocate its use in future studies due to ease of use of the API.

Data availability statement

The AudioSet dataset used to train models in this paper is openly accessible and can be found at this url: https://research.google.com/audioset/

Code availability

The neural architecture search was performed using Google’s Vizier API which has recently been open sourced [19]. More information about Vizier’s algorithms and compatibility with Google Cloud Platform can also be found in the Vertex Vizier documentation [22].

References

Bender G, Liu H, Chen B et al (2020) Can weight sharing outperform random architecture search? An investigation with tunas. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14,323–14,332
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Cai H, Zhu L, Han S (2018) Proxylessnas: Direct neural architecture search on target task and hardware. Preprint arXiv:1812.00332
Chiao A, Rechtenstein F (2021) Quantization aware training comprehensive guide. https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide
Choi S, Seo S, Shin B et al (2019) Temporal convolution for real-time keyword spotting on mobile devices. Preprint arXiv:1904.03814
Gemmeke JF, Ellis DP, Freedman D et al (2017) Audio set: An ontology and human-labeled dataset for audio events. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp 776–780
Golovin D, Solnik B, Moitra S et al (2017) Google vizier: a service for black-box optimization. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1487–1495
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hershey S, Chaudhuri S, Ellis DPW et al (2017) CNN architectures for large-scale audio classification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), vol 72, pp 131–135
Howard A, Sandler M, Chu G et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1314–1324
Howard AG, Zhu M, Chen B et al (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. Preprint arXiv:1704.04861
Li L, Talwalkar A (2020) Random search and reproducibility for neural architecture search. In: Uncertainty in artificial intelligence, pp 367–377
Lu B, Yang J, Jiang W et al (2021) One proxy device is enough for hardware-aware neural architecture search. Preprint arXiv:2111.01203
Monsoon Solutions I (2021) Monsoon power monitor manual. http://msoon.github.io/powermonitor/PowerTool/doc/Power%20Monitor%20Manual.pdf
Park DS, Chan W, Zhang Y et al (2019) Specaugment: a simple data augmentation method for automatic speech recognition. Preprint arXiv:1904.08779
Pauly O (2012) Random forests for medical applications. PhD thesis, Technische Universität München
Real E, Aggarwal A, Huang Y et al (2019) Regularized evolution for image classifier architecture search 33(01):4780–4789
Google Scholar
Sandler M, Howard A, Zhu M et al (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Song X, Perel S, Lee C et al (2022) Open source vizier: distributed infrastructure and API for reliable and flexible blackbox optimization. In: Automated Machine Learning Conference, Systems Track (AutoML-Conf Systems)
Swersky K, Snoek J, Adams RP (2014) Freeze-thaw Bayesian optimization. Preprint arXiv:1406.3896
Tan M, Chen B, Pang R et al (2019) Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2820–2828
Vizier V (2021) https://cloud.google.com/vertex-ai/docs/vizier/overview
Wu B, Dai X, Zhang P et al (2019) Fbnet: hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,734–10,742
Xu JX, Lin TC, Yu TC et al (2018) Acoustic scene classification using reduced mobilenet architecture. In: IEEE international symposium on multimedia (ISM), pp 267–270
Yang TJ, Howard A, Chen B et al (2018) Netadapt: platform-aware neural network adaptation for mobile applications. In: Proceedings of the European conference on computer vision (ECCV), pp 285–300
Yang XS (2010) Nature-inspired metaheuristic algorithms. Luniver press
Yang Y, Nam A, Nasr-Azadani M et al (2020) Resource-aware pareto-optimal automated machine learning platform. In: 2020 3rd international seminar on research of information technology and intelligent systems (ISRITI), pp 1–6
Zhang X, Zhou X, Lin M et al (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856

Download references

Acknowledgements

We thank our colleagues: Chansoo Lee, Hassan Rom, Kevin Kilgour, Marco Tagliasacchi, Mathieu Parvaix, Dan Ellis, Gabriel Bender, Quoc V. Le, Pete Warden, Merve Kaya, Grace Chu and Jason Rugolo for their advice.

Funding

This research was funded by X, The Moonshot Factory and Google. The authors of this article were all employed at either X or Google during the time the research was carried out.

Author information

Authors and Affiliations

X, The Moonshot Factory, 100 Mayfield Ave, Mountain View, CA, 94043, USA
Daniel T. Speckhard & Simon Carlile
Google Research, 1600 Amphitheatre Parkway, Mountain View, CA, 94043, USA
Karolis Misiunas, Sagi Perel, Tenghui Zhu & Malcolm Slaney

Authors

Daniel T. Speckhard
View author publications
You can also search for this author in PubMed Google Scholar
Karolis Misiunas
View author publications
You can also search for this author in PubMed Google Scholar
Sagi Perel
View author publications
You can also search for this author in PubMed Google Scholar
Tenghui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Simon Carlile
View author publications
You can also search for this author in PubMed Google Scholar
Malcolm Slaney
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DS, KM, SC, and MS conceived and planned the experiments. DS wrote the bulk of the code to perform the investigation. KM reviewed the bulk of the code, wrote necessary software and provided guidance on the experiments and software architecture. The NAS experiments were run by DS and KM. SP worked on all matters relating to Vizier such as the search algorithm choice, early stopping, writing/reviewing code, debugging, and analyzing results. TZ set up the experiments to gather mobile phone power data and wrote code to gather and interpret results. All authors contributed to the writing of the paper.

Corresponding author

Correspondence to Daniel T. Speckhard.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Consent to participate

Not applicable.

Consent for publication

Not applicable. The authors do not publish any individual’s data or images in this article and use previously available open source data.

Ethics approval

Not applicable. No human subjects or animals were used in this study. As such, no ethics approval was sought after.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Speckhard, D.T., Misiunas, K., Perel, S. et al. Neural architecture search for energy-efficient always-on audio machine learning. Neural Comput & Applic 35, 12133–12144 (2023). https://doi.org/10.1007/s00521-023-08345-y

Download citation

Received: 17 August 2022
Accepted: 25 January 2023
Published: 20 February 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00521-023-08345-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Neural architecture search for energy-efficient always-on audio machine learning

Abstract

Similar content being viewed by others

Towards the Design and Evaluation of Robust Audio-Sensing Systems

Urban sound classification using neural networks on embedded FPGAs

Neuromorphic Spiking Neural Network Algorithms

1 Introduction

2 Related work

3 Optimization criteria

3.1 Quantized accuracy

3.2 Physical energy measurements

3.3 Inference time and FLOPs as energy proxies

3.4 Approximating energy via a random forest for NAS

3.5 Memory footprint

4 Sound event classification dataset

5 Search space

5.1 K\(\times\)K first block

5.2 Second block

5.3 Other blocks

5.4 Kernel sizes

5.5 Filter sizes

5.6 1D variant of the search space

6 Vizier search algorithm

6.1 Early stopping algorithm

7 Results

8 Discussion

Data availability statement

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Consent to participate

Consent for publication

Ethics approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation