Multicore implementation of a multichannel parallel graphic equalizer

Numerous signal processing applications are emerging on mobile computing systems. These applications are subject to responsiveness constraints for user interactivity and, at the same time, must be optimized for energy efficiency. Many current embedded devices are composed of low-power multicore processors that offer a good trade-off between computational capacity and low power consumption. In this context, equalizers are widely used in multiple mobile-based applications such as “Music streaming” to adjust the levels of bass and treble in sound reproduction. In this study, we evaluate a graphic equalizer from audio, computational capacity, and energy efficiency perspectives, as well as the execution of multiple real-time equalizers running on an embedded quad-core processor of a mobile device. To this end, we experiment with the working frequencies as well as the parallelism that can be extracted from a quad-core ARM Cortex-A57. Results show that using high CPU frequencies and three or four cores, our parallel algorithm is able to equalize more than five channels per watt in real time with an audio buffer of 4096 samples, which implies a latency of 92.8 ms at the standard sample rate of 44.1 kHz.


Introduction
Low-power (embedded) processors play an important role for a myriad of signal processing applications, such as communications [1], image processing [2], visual detection [3], speech recognition [4], and audio processing [5], among others. In the era of smartphones and tablets, these energy-efficient architectures have increased significantly their computational capacity and are nowadays utilized in a large volume of multimedia, including video and audio processing. 1 3 In this context, equalizers are widely used in multiple mobile-based applications such as "Music streaming" to adjust the levels of bass and treble in sound reproduction [6]. In fact, equalizing filters are used for improving the frequency response of loudspeakers or headphones [7][8][9] and for reducing the effects of room acoustics on the sound quality [10,11]. A graphic equalizer consists of many filters with fixed center frequencies, and the gain of each filter, which is often called the command gain, is the only adjustable parameter [12], as shown in Fig. 1. A graphic equalizer can be implemented as a cascade of equalizing filters [12,13] or as a parallel bank of bandpass filters [12,14]. The parallel structure is more advantageous compared to the series one in terms of quantization noise performance [15], and also supports the parallel computation of the filter sections, leading to a performance benefit on GPUs, for example [16][17][18]. In this study, we evaluate the parallel graphic equalizer (PGE). Note however that here we are not utilizing the parallel structure of the filter for parallel computation. On the contrary, the parallellization comes from computing multiple channels in the same time. This also means that our findings are applicable to series equalizers as well, or any other real-time audio algorithms that should be computed for multiple channels.
However, the parallel graphic equalizer (PGE) has the disadvantage that updating its parameters requires about hundred times more operations than the update in a standard equalizer [14]. A parameter update is necessary during interactive operation every time a command gain is modified. When the parameters are redesigned, a complex target response must be produced from the command gain values. This has been made more efficient by computing the target response based on minimumphase basis functions and a more efficient WLS design in [19].
Nowadays, there exists system-on-chip (SoC) that delivers notable computational capacity while partially retaining the appealing low power consumption. One example of this type of system is the NVIDIA Jetson Nano board [20] which is based on a quad-core ARM Cortex-A57 CPU at 1.43 GHz with 4 GiB of LPDDR4 memory, and a 128-core Nvidia Maxwell GPU. One of its main features is its low power consumption combined with several levels of parallelism yielding a very high performance per Watt. Moreover, it also offers the possibility of adjusting its energy consumption by reducing the frequency of the CPU cores or the GPU.
Up to now, there have been studies analyzing the computational performance of filtering process of the PGE without taking into account the mandatory updating process that occurs when the sliders changes (see Fig. 1) [18], as well as works implementing this updating process in a sequential way [21]. In this work, we analyze the performance of the whole system in terms of audio processing and energy consumption as a function of CPU operating frequency and number of cores used. The aim of this work is to find an efficient implementation which exhibits a proper trade-off between performance and energy consumption taking into account the realtime constraint.
This paper is structured as follows. Section 2 offers a brief overview of the Pararallel Graphic Equalizer. Section 3 describes the multicore implementation. Section 4 explores the performance of the PGE in terms of maximum number of audio channels that can be rendered in real time; provides a detailed analysis of the power dissipation; and analyzes the energy efficiency of different hardware configurations. Finally, Sect. 5 closes the paper with concluding remarks.

Parallel graphic equalizer
Equalizers correct or enhance the spectrum of a signal in order to meet a desired requirement. Equalizers are widely used in music production and in sound reproduction to control the timbral balance of music [22,23], as well as to reduce the effects of room acoustics on the sound quality [10]. In graphic equalizers, the user controls the gain of each frequency band using a set of sliders that modify the desired magnitude response [12,14,[24][25][26]. Common graphic equalizers control the gain at 31 frequency bands spaced one third of an octave apart.
The basic idea of the PGE [14,19] is that based on the slider positions set by the user (top plot in Fig. 1), a smooth target frequency response H t ( n ) is computed, where n ( n = 1, 2, ..., N ) represents a finite set of angular frequencies. Then, the next step is the parameter estimation for the parallel IIR filters. The filter structure is composed of a set of parallel second-order sections having the transfer function where d 0 is called the direct path gain and K is the number of filter sections.
The first task in fixed-pole parallel filter design is setting the pole positions, which control the frequency resolution of the design [27]. It was shown in [14] that having two times as many pole pairs as command points and placing them logarithmically in frequency provides a sufficient resolution. The pole radii |p k | are set such that the magnitude responses of the parallel sections meet approximately at their −3-dB points [14,27]. The transfer function (1) becomes linear in its free numerator parameters b k,0 , b k,1 , and d 0 , when the denominator coefficients are determined by the fixed poles. Writing (1) in matrix form yields where = [H( 1 ) … H( N )] T is a column vector containing the resulting frequency response of the parallel filter, T is a column vector containing the free parameters, and is a modeling matrix, which contains the sampled frequency responses of the second-order all-pole filters 1∕(1 + a k,1 e −j n + a k,2 e −j2 n ) in its odd columns, their delayed versions (multiplied by e −j n ) in the even columns, and last, a column of 1's, which corresponds to the constant frequency response of the direct path. As in [14], we use a frequency-dependent weighting for LS error minimization [28,29], in the form of a diagonal matrix whose values are computed by the the weighting function W( n ) = 1∕|H t ( n )| , where W( n ) is a real-valued non-negative weight at frequency n .
Finally, we compute the optimal filter parameters opt . Note that this computation must be carried out each time the sliders change, since and are updated based on the slider positions: To this end, we use the decomposition of the matrix = , where is an orthogonal matrix and is an upper triangular matrix. Thus, the pseudo-inverse is computed as ] T and = [ 1 2 ] . Thus, we can discard those elements of that are multiplied by zeros. In order to obtain T , we need to obtain a second decomposition: ] T . Now we can compute opt as

3
Multicore implementation of a multichannel parallel graphic… Denoting = T , = 1 opt , and = E1 , we have to solve three triangular linear systems to obtain opt :

Multicore implementation
The aim of this work is to analyze how an embedded quad-core processor is able to manage a multichannel parallel graphic equalizer. To this end, we leverage OpenMP [30] to parallelize both the computation of the filter coefficients by using the weighted least squares method, and the filtering process that will be done through each one of 62 second-order sections that compose each one of the channels to be processed, as shown in Fig. 2. Note that all channels can have different equalizer settings, so each channel needs to compute its coefficients separately. That means that different OpenMP threads can perform the whole steps of different equalization in parallel: for each equalization, the thread first computes  In order to execute the operations that update the coefficients of the filter, we use various software libraries for numerical linear algebra, such as LAPACK (Linear Algebra Package) [31] and BLAS (Basic Linear Algebra Subprograms) [32]. Since we are dealing with matrices and vectors of small sizes, instead of using the multi-threaded versions of these libraries, we use their sequential version in combination with the parallelism provided by OpenMP to equalize different channels in parallel. Table 1 shows the routines that have been used in order to carry out the updating computations of the filters coefficients.

Evaluation platform
We have performed our experiments in a Jetson Nano development kit, which is a platform launched by NVIDIA in 2019. It combines low cost, high computational performance and low power consumption. Our platform contains a quadcore ARM Cortex-A57 processor, an NVIDIA GPU with 128 cores and a Maxwell architecture, and four GiB of LPDRR4 memory, among other components to develop all kind of applications.
All our experiments were executed in the A57 multicore processor designed by ARM implementing the ARMv8-A 64-bit microarchitecture. It contains two MiB of L2 cache, 48 KiB of L1 instruction cache and 32 KiB of L1 data cache. Each core includes a vector floating point unit and supports the NEON SIMD extension. Multicore implementation of a multichannel parallel graphic… 4 Experimental evaluation

Experimental parameters
In this section, we show the experimental evaluation of the parallel graphic equalizer on the four cores of the Cortex-A57 processor. In the experiments, we have analyzed the effects of using from one to four cores both in terms of execution time and energy consumption of the algorithm. We have evaluated the effect of using the 14 different frequencies allowed by the A57 processors, which vary from 102 to 1479 MHz. We have performed experiments using different buffer sizes varying from 64 to 8192 samples. One of the goals of our evaluation is to assess the effect of the buffer size in the maximum number of channels that can be processed in real time. In addition, the experiments have been carried out with two main objectives. The first objective was to find the combination of number of cores and frequencies that maximizes the number of channels that can be processed in real time without taking into account the power consumed by the device. The second objective was to take into account the power consumption and find the optimal combination of number of cores and frequencies that allows to process the maximum number of channels per watt in real time.

Execution time
First, we analyze the evolution of the execution time of the algorithm with the number of channels and the effect of parameters such as the buffer sample size and the number of CPU cores. We run our first experiments at maximum CPU frequency (1479 MHz) and we use an audio buffer size of 4096 samples, which implies a real-time threshold of 92.8 ms at the standard audio sampling frequency of 44.1 kHz. Figure 3 shows that the execution time increases linearly with the number of channels and that it can be substantially reduced by using more cores. As a consequence, by increasing the number of cores we can also increase the number of channels that can be processed in real time. For example, if we execute the sequential version of the algorithm in one core we can process up to seven channels in real time. However, if we execute the parallel version using the four cores of the CPU, we can process up to 27 channels in real time.
To analyze the advantage of using more CPU cores we show in Fig. 4 the speedup obtained in the same experiments shown in Fig. 3. If we take as optimum a speed-up equal to the number of cores, then we obtain the optimal speed-up when using two and three cores and near the optimal when using four cores. The sawtooth pattern shown by the lines is due to the load balancing effect on the different cores. That is, we obtain the best speed-up when we process the same number of channels in every core. Logically, when we increase the total number of channels to process, the load imbalance is reduced and so is the height of the teeth.
Next we analyze the effect of increasing the buffer size on the maximum number of channels that can be processed in real time. Figure 5 shows that the number of channels increases at a steep rate for small buffer sizes, but tends to converge to a maximum for larger sizes of the buffer. This can be explained by the fact that the computational load is made up of two factors: first, the computation of the filter coefficients, which is done only once for each buffer, and second, the filtering process, which is applied sample by sample, and thus depends on the buffer size linearly. Therefore at small buffer sizes, where the coefficient calculation  dominates the computational load, increasing the buffer size decreases the average load significantly, thus increases the number of channels we may process.
In contrast, at larger buffer sizes the computational complexity gets dominated by the filtering process, and thus there is no further advantage of increasing the buffer size. The disadvantage of too large of a buffer size is larger latency in the processing and constraining to user to update the filter parameters at a smaller rate. Since we approach the maximum throughput with a buffer size of 4096 samples, we use this size in the rest of the experiments. Note that another way of increasing the computational efficiency would be simply not calculating the filter coefficients for each buffer, but only for every tenth buffer, for example, while using smaller buffer sizes. However, this has not been investigated in the present study.

Energy consumption
One of the main goals of using SoC, such as the one included in the Jetson Nano board, is to combine high computational performance with low energy consumption. In this section, we will evaluate the effect of modifying the frequency of the CPU cores on the number of channels that can be processed in real time, and also on the energy consumption of the platform. We obtain the energy consumption as the product of the time taken by our algorithm to complete and the measured power dissipation of the whole platform. As we want to measure only the energy consumed by our algorithm, we previously subtract the power dissipated by the operating system processes while running at the default frequency of the CPU.
Firstly, we will evaluate the effect of reducing the frequency of the cores in the processing time of eight channels using a buffer of 4096 samples. Figure 6 shows Maximum number of channels in Real Time

Fig. 5
Effect of the buffer size on the maximum number of channels that can be processed in real time that using only one core we cannot process eight channels in real time even using the maximum frequency. However, if we use two cores, we can reach this number using a wide range of the frequencies of the CPU. As we increase the number of cores, we can process 8 channels in real time with very low frequencies which, as we will see, can greatly reduce the energy consumption of the algorithm. Figure 7 shows how the maximum number of channels that we can process in real time evolves both with the number of cores and the CPU frequencies.  of these two parameters increases that number from only one channel using one core at 307.2 MHz to 27 channels using four cores at maximum frequency. The figure also shows that using the two lowest frequencies of the CPU we cannot process any channel in real time even using the four available cores. Next we show how the CPU frequency and the number of cores affects the energy consumed by the algorithm. Figure 8 shows that as we increase the number of cores, we reduce not only the processing time, but also the energy consumed by the algorithm to process eight channels. Regarding the CPU frequency, increasing it does not always reduce the energy consumption. As a matter of fact, the lowest consumption is obtained using four cores at 1132.8 MHz. Moreover, using frequencies as low as 307.2 MHz with four cores we approach the most efficient case in terms of energy consumption.
On the other hand, if we want to increase the number of channels processed in real time, we have to increase the frequency or the number of cores. However, we pay a price in terms of energy consumption, as it can be seen in Fig. 9.
Finally, in order to assess the energy efficiency of the algorithm we will use a parameter that combines the computational performance and the energy consumption. On many occasions, we may be interested in reducing the time required to process each channel, but without an excessive increase in power consumption. For this purpose, we are interested in finding the value where the energy consumption of processing each channel is minimal. Moreover, in our case we keep the constraint that the channels must be processed in real time. To calculate this value, we can divide the maximum number of channels processed in real time that we can see in Fig. 7 by the power dissipated by the platform during their processing. In this way, we will obtain an efficiency metric that will give us the number of channels per watt in real time. Figure 10 shows again that the maximum energy efficiency is obtained Energy consumed (mJ) to process the maximum number of channels in RT  Processed channels per watt in real time Fig. 10 Effect of the CPU frequency and number of cores in the number of channels per watt that can be processed in real time

Conclusions
In this work, we have analyzed the performance of a multichannel parallel graphic equalizer on an embedded multicore system from audio, computational performance and energy efficiency perspectives. Specifically, we have used the quad-core ARM Cortex-A57 that is embedded on the NVIDIA Jetson Nano board for our experiments. We have been aiming for a configuration that provides an efficient trade-off between computational performance and energy consumption. To this end, we have analyzed different configurations varying the number of cores, the buffer size of the audio samples, as well as the frequency of the CPU cores. From the results, we can extract that it is not necessary to exploit all the cores or to work at the maximum frequency in order to achieve an efficient implementation. Our experiments indicate that using a high CPU frequency and three or four cores, our parallel algorithm is able to equalize more than five channels per watt in real time using an audio buffer of 4096 samples, which implies a latency of 92.8 ms at the standard sample rate of 44.1 kHz.