A Low gate count reconfigurable architecture for biomedical signal processing applications

A new reconfigurable architecture for biomedical applications is presented in this paper. The architecture targets frequently encountered functions in biomedical signal processing algorithms thereby replacing multiple dedicated accelerators and reports low gate count. An optimized implementation is achieved by mapping methodologies to functions and limiting the required memory leading directly to an overall minimization of gate count. The proposed architecture has a simple configuration scheme with special provision for handling feedback. The effectiveness of the architecture is demonstrated on an FPGA to show implementation schemes for multiple DSP functions. The architecture has gate count of ≈\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\approx$$\end{document}25k and an operating frequency of 46.9 MHz.


Introduction
Advances in Integrated Circuit (IC) technology have resulted in miniaturized devices useful for wearable biomedical applications. Custom and configurable solutions for Electrocardiogram (ECG) processing have been discussed extensively in the literature [1][2][3][4]. As an example, [5] discusses a low-energy microprocessor ( P ) for R-wave detection, heart rate calculation and arrhythmia based on the R-R interval with a 78% detection accuracy. The use of microprocessors provides flexibility in configuration that translates to cost-effective and time-efficient solutions but adversely affects the power budget for compute-intensive operations. Researchers in [6] discuss low power system that combines off-the-shelf P with dedicated ECG processing for data compression and arrhythmia detection. In [7,8], custom accelerators for DSP functions used in biomedical signal processing are discussed. These accelerators in conjunction with a customized low power P are used for ExG (EEG (Electroencephalogram), ECG) processing, targeting a large subset of biomedical applications. A miniaturized system in [9] demonstrates a limit on radio transmissions indicating alerts and alarms instead of continuous streaming of raw data while monitoring critical physiological signals in patients. It provided a significant increase in the power efficiency of the overall system due to a decrease in radio utilization, emphasizing the need to maximize local processing. Generic hardware accelerators discussed in [8,10,11], target common signal processing techniques predominantly used in biomedical applications. This ensures the use of accelerators in multiple varied algorithms and provides an energy-efficient solution with the flexibility to map a large set of biomedical applications. However, multiple accelerators can increase the overall gate count of the system. The potential performance benefits of a low gate count architecture are discussed in [12] and an analysis of the effect of gate count and critical path of architecture in the sub-threshold region on power consumption is highlighted in [13].
It is possible to combine the performance and efficiency benefits of hardware with the flexibility of software by using reconfigurable systems where the hardware (re) configures itself to perform multiple functions of an application in a time-multiplexed manner. The regularity of certain functions used in biomedical signal processing algorithms is exploited in this work which results in reduced overhead and consequently, a lightweight architecture is obtained. A novel reconfigurable architecture based on shift-accumulate operation is configurable for frequently processed algorithms. The architecture has a configurable datapath and supports a range of biomedical signal processing algorithms and is verified on a field programmable gate array (FPGA) platform.
The developed architecture is implemented on an FPGA as a proof-of-concept and can be realized as an ASIC acting as an accelerator for compute-intensive operations alongside an embedded processor. The compute-intensive operations, such as CORDIC, DCT, etc., are often seen in biomedical signal processing algorithms and take prolonged time to process such operations on embedded processors. A hardware design with dataflow-based execution offers processing in reduced number of clock cycles as compared to an embedded processor. Additionally, the developed architecture is configurable and thus can span a large set of biomedical algorithms imitating (to some extent) the complete programmability advantage of embedded processors.
The types of biomedical functions addressed are presented in Sect. 2. The overview of the proposed architecture is discussed in Sect. 3 followed by different architectural topologies in Sect. 4. The mapping methodologies for target functions are discussed in Sect. 5. The results for different mapping cases on the hardware are presented in Sect. 6. Conclusion and future work are discussed in Sect. 7.  [14]. Additionally, the DWT and DCT algorithms are reported to be used for the signal compression purpose to reduce the transmission share of the radio [6]. Finally, complex trigonometric function computations are reported alongside the fundamental operations of multiplication, addition, scaling, division, etc. [15,16]. Clearly, it can be observed from Table 1 that FIR or IIR, CORDIC, DWT, DCT, differentiation and moving average form the dominant operations encountered in biomedical signal processing applications. As a consequence, these operations constitute the target domain functions for any signal processing hardware platform. In the next section of this paper we show that by manipulating a simple  [16] IIR, Finite Impulse Response (FIR), d dx , ( ⋅ ) 2 , moving average QRS Detection [18] FIR, d dx , ( ⋅ ) 2 , weighted moving average QRS Detection [19] Running slope, multiplication, scaling, average, maxima ECG fiducial points [20] Discrete Wavelet Transform (DWT), local maxima modulus EEG Epileptic Seizure onset [21] FIR Seizure Detection [22] DWT Automatic recognition of alertness [23] DWT PPG Pulse Oximetry [24] FIR, log Blood Saturation [14] Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT) Heart Sound Auscultation Aid [15] FFT, tan , sin mathematical function that forms the basis of these signal processing operations, it is possible to exploit the serial nature of the computation to derive an efficient and reconfigurable circuit.

Functionality profile of biomedical applications
3 Shift-accumulate (SAC) architecture

Introduction
As discussed previously in this paper, the shift-accumulate (SAC) architecture is central to potential improvements in the hardware. This section describes the details of the SAC architecture, starting with the mathematical underpinning of the approach.

Mathematical foundation
In order to establish the mathematical foundation of the SAC approach, an example of a FIR filter can be examined.
In an N-tap FIR filter, the output y[n] with inputs x[n] and coefficients b i is given by: When the inputs and coefficients are represented as 8-bit binary numbers Eq. 1 can be rewritten as: (where coefficients ( b i ) are expressed in terms of bit value ( b ik ) and bit weights ( 2 k )) Rearranging Eq. 2 results in Eq. 3.
The foundational mathematical equations of the SAC architecture is presented in the paper. The pseudocode of the proposed scheme is shown in Algorithm 1.
The resulting mathematical equation is obtained wherein the partial products can be generated, For i = 0 to 7 Multiply coefficient and data serially using bit-wise and operation Add partial products If Data Valid: Go to Process End Log Output Result accumulated sequentially and the common weight partial products then accumulated, showing clearly how the SAC approach can be implemented in this case. The resulting reduction in hardware elements can be clearly seen when the implementation is examined more closely. This function can be realized using a 9-b adder tree containing N adders. The final accumulation uses one 16-b SA as opposed to N shift-accumulators while implementing hardware using Eq. 2. Eq. 3 implementation uses 143N+240 gates with 16 flip-flops for its realization and provides 13.43× and 3.47× saving, respectively, over Eqs. 1 and 2, as indicated in Table 2.

Architecture and operation
The proposed architecture, shown in Fig. 1, is a Shift-Accumulate (SAC) architecture, consisting of Register Units (RU), a Computation Unit (CU) and a Control Unit. The architecture is arranged as a 6 × 6 array of functional units called Register Units (RUs) based on Eq. 3. The partial products generated in RUs are fed to the Computation Unit (CU) and the output is obtained after partial products accumulation and shift adjustments. In addition to controlling the function execution, the control unit generates necessary synchronizing signals for RUs and CU.

Register unit (RU)
The Register Unit consists of registers configured as an 8 × 1 AND matrix and 8 × 1 XOR matrix (Fig. 2) to generate partial products. The x[n] input and b i coefficients are loaded into data register (D 7∶0 ) and coefficient (C 7∶0 ) register, respectively. These registers hold the multiplicand and multiplier for the multiplication. The coefficient is further loaded to a shifter block where the multiplier is rotated left by a bit at every clock cycle. The MSB of the shifter is fed to the AND matrix and D 7∶0 follows as the second input. The unsigned partial product thus generated is further converted to a signed partial product by XORing it with the sign bit. A 9-bit 2:1 multiplexer makes the selection between registered and unregistered incoming data. The unregistered data is used in case of functions with feedback where the additional cycles incurred during processing have to be considered for synchronization. Following this, the input data is forwarded to the next RU after eight clock cycles.

Computation unit
As shown in Fig. 1, the Computation Unit (CU) is divided into four identical subsections where each subsection acts on nine partial products. It consists of an adder tree, a shift-accumulator and an output register. The adder tree has ripple carry adders that adds nine partial products each of 9 bits. The 23-bit output register is loaded with the final sum once the serially computed multiplication concludes.

Operation
The coefficient register is loaded with the multiplier in one cycle followed by multiplicand loading in the data register and multiplier forwarded to shifter block in the subsequent cycle (Cycle #1). The multiplicand is ANDed with 7 th bit or MSB of multiplier generating the unsigned partial product. In the next clock (Cycle #2), the shifter block would have rotated the multiplier left by one place and 6 th bit becomes MSB now as illustrated in Fig. 3. The signed partial products for 6 th bit are generated and are added in adder tree. In such a case, the shift-accumulator block has a 7 th bit partial product sum (PP 7 ) that is shifted left by one place (multiplied by 2 1 ). The 6 th partial product sum (PP 6 ) is added with the shifted PP 7 in the 21-b adder. The shift-accumulator register is now loaded with PP 7 ⋅ 2 1 +PP 6 . In the following clock (Cycle #3) *(due to another left shift) PP 7 is multiplied by 2 2 and PP 6 is multiplied by 2 1 . This is added with 5 th bit partial product sum that results in PP 7 ⋅ 2 2 +PP 6 ⋅ 2 1 +PP 5 . This continues till the shift-accumulator register holds the final output of PP 7 ⋅ 2 7 +PP 6 ⋅ 2 6

Configurable datapath and control word
The proposed Shift-Accumulate (SAC) architecture supports functions that can be expressed as multiply-accumulate operation using a control word. It should be noted that certain functions may not require all the Register Units (RU) and unused RUs can be isolated by means of configurable datapath.
The control word is 54 bit wide and has Code, Config, Feedback and Resolution fields as shown in Fig. 4. The 5-bit Code field is used to identify the target function. The 31-bit Config field denotes the select lines (marked CMx and BPx in Fig. 4) of the multiplexers placed in the datapath. Two sets of multiplexers, identified as the Configuration and Bypass, respectively, are used in the datapath and allow different connection topologies between RUs. Additional configuration multiplexers are placed between the tiles to The least significant thirteen bits in the control word, labelled FBx and fbx, represent the select lines of feedback multiplexers (marked FBx_mux in Fig. 4) and constitute the feedback field of control word. Bits 0-4 of the control word are the select lines (marked fb in Fig. 1b) of the multiplexer inside RU that forwards registered or unregistered multiplicand. These bits are dedicated for RUs that receives feedback data, i.e. RU #1,4,19,22 and 34.
The leading four bits of the control word (Resolution field) provide the resolution of output. This is important in functions having feedback loop where the 23-b output data is limited to 9-b (width of RU data register). The resolution field indicates the relevant output bits required to be saved for further processing. The remaining bits of the control word are reserved for future modifications.

Coefficient multiplexers
While mapping, different set of fixed coefficients are applied on RUs. For example, DCT and DWT forces their respective coefficients on RU# 19-30 as discussed in Sect. 5. The RUs where multiple coefficients are forced including the fixed coefficients are indicated in Fig. 5. For example, the DCT line is selected while mapping the DCT and is connected to DCT multiplexer that provides DCT transform matrix coefficients. The transform matrix has seven distinct coefficients that appear in either positive or negative form. Wavelet transform coefficients can be applied as external coefficients as the hardware supports multiple wavelets. A detailed discussion on coefficient multiplexers is provided in [25] and [26].

Architecture topologies
The configurable datapath of the SAC architecture exhibits various topologies as following:

Systolic array structure
The RUs can be arranged in a continuous chain of nine, eighteen, twenty-seven or thirty-six RUs as shown in Fig. 4. The most straight forward structure of thirty-six RUs forming a chain is formed with RUs connected one after the other. In such a case, the bypass multiplexer selects are set and configuration multiplexers pass the previous RU output. For eighteen RU, BP19 forced to '0' . A chain of nine RU is obtained by selecting tile# 1 as it consists of RU #1,2,3,10,11,12,13,14,15 by forcing '0' and '1' on CM10B and CM13, respectively. Additionally, the select line of all bypass multiplexers is set to '0' except for BP10 and BP13. Similarly, a twenty-seven RU chain is formed by combining the eighteen RU chain and tile #3 and bypassing tile #4.

Four tile structure
The four tile structure is obtained by dividing the RUs into four groups of nine RU blocks. This configuration enables parallel execution of functions as each tile forwards its partial products to their respective CU subsection that computes the results irrespective of the ongoing computation in other tile or CU subsection. The configuration multiplexers CMxB select line are forced to '0' thereby passing forward the data from the RU above it. For instance, Dataout #3, 9, 21 and 27 are fed to RU #10, 16, 28 and 34, respectively. The bypass multiplexers are set when all the four tiles are in use and a tile is isolated from the datapath by clearing the bypass multiplexer of RU leading the tile. Tiles can be connected in a chain structure that is beneficial for function chain realization where each function can be mapped on a separate tile. This can be achieved using the feedback multiplexers. Feedback multiplexers are used to direct the output (or combined output) of tile (or tiles) to input of other tiles. For instance, the four tile chain can be realized by connecting Tile #1 -> Tile #2 -> Tile # 3 -> Tile #4 by setting multiplexers FB4_mux, FB19_mux and FB22_mux to "00", "01" and "10", respectively. Following this topology, multiple functions are executed simultaneously while reducing the reconfiguration overhead.
Additionally, the architecture also supports mixed topologies. For instance, one of the possible combinations of systolic tile and array structures can result in datapath from Tile1 -> Tile2 -> RU #19-36. These topologies can be obtained by setting the necessary CMx and BPx bits in the control word.

Mapping of different functions
The configurability of the developed SAC architecture can be leveraged to realize multiple biomedical signal processing operations on it, because the majority of the functions include multiplication and addition. Mapping methodologies for these commonly used functions are presented in this section. The target functions, also optimized for the proposed hardware, are classified broadly into two categories: Variable and Fixed coefficient functions. The coefficients of variable functions depend on the characteristic parameters of the function. For example, bandwidth, gain and number of taps determine the coefficients of FIR filter. In contrast, the fixed coefficient functions have constant coefficients and they undergo fixed steps resulting in constant coefficients. Therefore, it is redundant for the user to explicitly provide constant coefficients to the hardware and is eliminated by means of simple state machines. The architecture contains state machines for all the target functions. Based on the control word, one of the state machines becomes active and controls the function execution.

Variable coefficient functions
The multiplication, squaring, addition and multiplyaccumulate (MAC) functions are based on the common equation: ∑ 11 i=0 a i ⋅ b i . A special case of MAC, where a i = 1, represents the addition operation. For multiplication a i (and/or b i ) is 0 for i>0 and for squaring, a i = b i is done in the multiplication. The 12 RUs, marked #1, 4, 7, 10, 13, 16, 19, 21, 24, 27, 30 and 33, receive external data and computation takes eight clock cycles. It is possible to map this operation by choosing a different set of 12 RUS anywhere in the architecture but with a higher latency.
The mathematical equation for FIR filtering, differentiation and moving average can be readily expressed as a MAC operation. However, the 2-D convolution includes the product of data and coefficient matrices as The convolution consist of multiply-accumulate with spatially shifted input samples. The spatial shifts in the input matrix can be handled by careful interpretation of input rows in the memory [12]. By doing so, the 2-D convolution yields to 1-D convolution operation of length N 1 × N 2 . The generalized equation of FIR filtering, Differentiation, Moving Average and 2-D convolution thus becomes This output is based on the previous inputs and require memory elements. The architecture supports thirty-six taps for these set of functions. The external data (x[n]) is supplied on RU #1 and RUs can be connected in chain structure using the configuration and bypass multiplexers. The 2-D convolution requires input ports equal to the number of rows in the transform mask. The mask size of up to (6× 6) can be mapped on the proposed hardware.

Fixed coefficient functions
Functions grouped in this category perform computations using a state machine. The state machine transforms the algorithm into a series of sequential steps. The fixed coefficients are stored in register files.

COrdinate Rotation DIgital Computer (CORDIC) [27]
It is derived from the general rotation transform and provides an iterative method of performing vector rotations by arbitrary angles using only shift and add operations. Generalized version of the algorithm is given by Eq. 4 [28] and includes hyperbolic and nonlinear functions, logarithm, square root computations in addition to the standard trigonometric functions.
The top level hardware architecture for the CORDIC is shown in Fig. 6a. The 'X and Y Computing Block' computes the x and y variables, whereas the 'Angle Accumulator Computing Block' computes the z variable. The computation uses an adder or a subtractor and result from the previous iteration. The right-shift operation ( >> ) can be viewed as multiplication with 2 −i . The SAC architecture performs multiplication where the 9-b coefficient register is loaded with the content 2 −i in the i th iteration. The data register is loaded with the x, y or z variable (or value) and the subsequent additions are carried out in the CU. [25] provides the seed value for various functions along with m and .
Tile #1 and tile #3 are used for CORDIC mapping (Fig. 6b). The RUs are not connected with each other in both the tiles except for RU #19-21-⟩28-30 forming a 6 RU chain. The seed values x 0 , y 0 , z 0 are fed into RU #1, 10 and 13 from the three read ports of circular memory (Fig. 6c). The y variable is computed first followed by z and x throughout the iterations. As y or z needs to undergo direction decision logic for next iteration, this process is overlapped with x computation so that no clock cycles are wasted. y 1 is computed using y 0 and x 0 from RU #1 and RU #13, respectively, and the result stored in RU #19. At the same time, all the RUs pushes forward its data register content to the next RU in the chain. The arctan data is forced on RU #10 through internal multiplexer (CM10_int). Variable z 1 is computed using RU #11 (z 0 ) and RU #10 (arctan) and z 1 is stored in RU #19, whereas y 1 shifts to RU #20. Using RU #15 (x 0 ) and 3 (y 0 ), variable x 1 is computed and stored in RU #19. At this stage, y 1 , z 1 and x 1 are in RU #21, RU #20 and RU #19. y 2 is computed using RU #21 and RU #19, z 2 using RU #21 and RU #10 and x 2 using RU #21 and RU #29. It should be noted that, previous value is present in RU #21 for x , y and z. The second variable of the equation (x or y) lies in RU #19, RU #10 and RU #29. Same sequence is followed with a state machine that follows a defined pattern. The algorithmic state machine shown in Fig. 7 consists of six states and depicts the RU coefficients loaded in each state during CORDIC computation. The coefficients for RUs used in CORDIC algorithm mapping are applied through the CORDIC input of coefficient multiplexers. A CORDIC multiplexer shown in Fig. 5 is connected to the coefficient multiplexer CORDIC input of RU #19 and RU #29. The multiplexer has 2 i terms in binary as inputs and its select line is controlled by the state machine and takes 24 cycles for each iteration. It stores results internally in RUs and does not use additional memory.

The 2-D Discrete Wavelet Transform
2-D DWT translates to two consecutive matrix multiplication between the input and DWT mask coefficients. The results of matrix multiplication can be directly realized as multiply-accumulate operation and thus can be supported by the SAC architecture. The DWT decomposes a signal into four frequency sub-bands, namely LL (approximation (cA) matrix), LH (horizontal (cH) matrix), HL (vertical (cV) matrix) and HH (diagonal (cD) matrix). A generic mapping methodology is discussed that supports twenty-eight wavelets with filter size <8 [26]. The DWT algorithm first computes low pass filtering followed by down sampling by a factor of 2. The resultant vector is convolved with high pass filter and the LL DWT matrix is retained (with 75% compression) after another down sampling by a factor of 2. Additionally, periodic padding [29] is done on either side of the input samples and intermediate results to ensure spatial continuity on boundaries. The DWT computation steps are shown in Fig. 8a wherein the row wise convolution is carried out on a 14× 8 padded image and an intermediate matrix, Y is generated. Columns 0-3 of padded image are same as columns 5-8 of the actual image. Thus, the convolution filter mask is modified by swapping the first four terms with the last four terms. This eliminates the use of excess memory elements required to store the padding. Furthermore, the mask advances by two steps as opposed to one step in the conventional convolution performing down sampling along with the intermediate matrix generation. The second set of convolution is performed on transposed padded Y matrix.
The proposed architecture supports 8 × 8 block 2-D DWT operation with the mapping scheme are indicated in Fig. 8b. The five states of the DWT state machine are shown in Fig. 9a  Similarly, the rest of the seven elements of first row of Y matrix are generated in 64 cycles. RU #19-RU #26 holds the first row which requires padding on either sides. This is done by means of convolution mask rearrangement. The convolution filter is rearranged by moving the first four coefficients towards the end. By doing this, the padded elements which are at the bottom of unpadded image gets multiplied with the respective coefficients yielding the same result as padded image with the unmodified filter mask (Fig. 9b). The mask manipulation enables computing cA elements without external padding on the intermediate matrix which further permits storing the intermediate matrix. The circular memory write pointer is used to write 8 × 8 (= 64) input data which is stored column-wise in different sections of the memory in an interleaved manner (Fig. 9c) for simplifying the reading operation. It also ensures moving the convolution mask by 2 steps, a scheme adopted in the mapping to eliminate down sampling later. Consequently, multiple read ports ensure reading all columns simultaneously resulting in computation of Y matrix to be computed in eight clock cycles.

The 2-D discrete cosine transform (DCT)
Matrix computation (of an image) requires two matrix multiplication operations : DCT = DCT mat * Image * (DCT mat ) T , where transform matrix (DCT mat ) becomes constant for a fixed mask size. Matrix multiplication can be realized on the proposed architecture by loading row and column elements of the two matrices in place of coefficients and data registers, respectively. The architecture supports 8 × 8 block wise DCT and provides a compression of ≈ 84% (10 of 64 output samples are retained). The DCT mapping datapath is similar to the DWT mapping methodology shown in Fig. 8b. The first matrix multiplication or the intermediate matrix Int mat = DCT mat * Image, row wise image data and first row of DCTmat are forced on RU #1, RU #4, RU #7, RU #10, RU #13, RU #16, RU #31 and RU #34 as data and coefficient, respectively. The results form the first row of Int mat that is periodically stored from RU #19-RU #26 and has to be further multiplied with DCT mat T . The row coefficients of the transform matrix are forced on coefficient registers of RU #19-RU #26 and the resultant DCT(0,0) is obtained. On forcing subsequent DCT mat rows in the coefficient of RU #19-RU #30 chain, the remaining elements of the first row of DCT are obtained. It should be noted that, Int mat stored in RU #19-RU #26 shifts down the RU chain every 8 clock cycles. This is incorporated by the state machine (Fig. 10) that consists of 5 states traversed four times, fully or partially to obtain Int mat and DCT in successive computations. The state sequence differs with the iteration#, because the number of elements computed in the DCT matrix decreases by one with every row resulting in an upper triangular matrix. The Int mat matrix is computed in state #1 and DCT matrix is computed in states #2-5. The image data is stored column wise in an interleaved manner in circular memory similar to data storage in DWT mapping discussed earlier.

Results and discussion
To demonstrate the feasibility of the proposed architecture, the design was targeted onto a Virtex-II FPGA board (XUPV2P). The FPGA development board speeds up the verification process by providing suitable interfaces in VGA and RS-232 ports among many other peripheral interfaces. The proposed architecture uses the sign-magnitude convention to represent 9-b signed numbers. The code, config fields and topologies for targeted functions are presented in Table 3, where the code acts as target function identifier and controls the data read write operations of circular memory. The feedback gives rise to erroneous outputs as fixed-point binary numbers are used. This is addressed by making the resolution configurable that detects the correct output slice fed back to the RUs.

Hardware configuration
The architecture is configured by loading the control word, external coefficients and data in their respective memories. These values are provided to the architecture through a 9-b input bus in a time multiplexed manner. The 54-b control word is applied first serially in 9-b slices. The external coefficients are loaded next and comprise of 36 9-b binary numbers. Following this, input bus carries the data designated for the 9-b wide 64 position deep data memory. The input bus is controlled by mutually exclusive control signals that are used to demarcate the control word, coefficients and data.
The system level block diagram of the proposed architecture is shown in Fig. 11. The architecture has multiple memory modules for configuration, control word, coefficient, data and output. The configuration memory is initialized using a coefficient file (.coe) that can be generated using MATLAB for large data sets. Data memory has an adaptable structure to support varied requirements of the target functions. A UART serial port (bit rate set at 115200 bps) is interfaced with the hardware to transfer the output from FPGA to the PC for post processing.

Pan-Tompkins algorithm signal chain
To verify the usefulness and functionality of the proposed hardware, we have targeted a popular algorithm known as the Pan-Tompkins algorithm (PTA). This is a widely used algorithm for QRS detection in Electrocardiogram (ECG) signals that detect QRS complexes with accuracy up to 99.3% [16]. The ECG signal from the MIT-BIH arrhythmia database is given as input to SAC architecture after converting it to .coe format. The algorithm is based on analysing the amplitude, slope and width of QRS complexes and is modelled with steps shown in Fig. 12. The transfer function of the steps is shown in Table 4. Careful observation of the transfer functions of QRS signal chain indicates that LPF and HPF (collectively BPF), differentiation, squaring and MA functions can be realized by a series of shift-accumulate operations resulting in multiplier-less implementation of these functions. The control word and datapath for PTA signal chain is tabulated in Table 4. The proposed architecture is first configured as a LPF and the result is computed in 8 clock cycles. Following this, the architecture is configured to perform HPF. It should be noted that the HPF operates upon the LPF output. Therefore, the LPF output is loaded into the circular memory which further acts as the input data to the proposed architecture. The HPF output computation takes 8 clock cycles. The architecture is further configured to emulate differentiation, squaring and MA functions. The control word and the coefficients of these functions are stored in the configuration memory that is accessed using a state machine after each operation.
The LPF block has a window of 13 inputs with feedback of 2 output samples. RU #1 accepts the input data from Dataext1 port and a chain of 13 RUs (RU #1 -#13) is formed. The configuration multiplexers are configured to forward the data from previous RU and bypass multiplexers are set. The output samples are loaded into RU #34 and #35 through the feed1234 dataline. Control word bit 0 (fb34) is set as RU #34 carries feedback data. The remaining RUs are disconnected from the datapath using bypass multiplexers. The computed hardware result for each block is shown in Fig. 13. The peaks obtained in the moving average output represent the QRS complex that can be detected further using a peak detector circuit. It can be seen that minor peaks are flattened in hardware results due to limited resolution in feedback. The QRS peaks  detected on the hardware processed data coincides with the analytically detected peaks. The architecture takes 485 clock cycles/sample (= 10.34 s or throughput 96.7e3 samples/second at 46.9 MHz operating frequency), including configuration, data loading and processing, while emulating the PTA signal chain. Biomedical signal processing applications pose moderate processing requirements on the computing platform primarily due to low data rates (up to 320 Kb/s) [30] of physiological signals. Additionally, according to a study in [31], the operating frequency of the processing platform are limited to 10's of kHz and (8-12)-bit processing for biomedical devices. This leaves sufficient clock cycles to tolerate the latency of the serial SAC architecture of the ECG application mentioned above that has been demonstrated to take 10.34 s to process a single data sample. Also, SAC architecture supports 8-b signed arithmetic that has been adopted in various matured applications in the domain [31]. However, there is scope for incorporating the applications requiring increased precision by increasing the number of bits in the data/coefficient registers and scaling up the adder units accordingly. It should be noted that, we have not considered Electromagnetic Interference of ECG signals in our setup. The work is primarily focused on developing a configurable lightweight processing platform for biomedical signal processing algorithms. The EMI and other artefacts are assumed to be taken care of by the acquisition system which is beyond the scope of this work. We used the ECG signal from the MIT-BIH arrhythmia database is given as input to SAC architecture. The ECG signal is collected from the patients in normal clinical setting and are not filtered. The type of noises in the database is not explicitly mentioned. Table 5 notes the range and resolution supported by the hardware for CORDIC. The results obtained from the hardware agrees well with the ideal (MATLAB) results for functions such as sin(cos); sinh(cosh); e x and ln(x) and are shown over their respective range shown in Fig. 14a-d.

CORDIC, DCT and DWT results
The architecture computes cosine with accuracy ranging from 83 to 98% and the resolution is 2 −1 (= 1 bit) when the decimal point is between 1 st and 0 th bit position. Thus, the smallest measurable degree is 0.5 • while the remaining 7 bits denotes range of the data. The test setup for the DCT is shown in Fig. 15a. It uses a 128×128 image as input. The 8 × 8 sub-images undergo the DCT operation on the proposed architecture and results are processed in MATLAB which reconstructs the image by performing offline block wise IDCT. The reconstructed image obtained from the hardware and MATLAB are shown in Fig. 15b and c, respectively. The L2 norm is reported as 15.77 and 15.75 for hardware and MATLAB reconstructed images, respectively, with respect to the actual image.
To demonstrate the feasibility of the proposed approach, the 2-D DWT with two standard test images is taken. A 16× 16 checker board image with Haar wavelet and a 128×128 Lena image with bior2.2 wavelet is computed on the hardware. The mean square error between the actual and hardware reconstructed image for the Lena intensity image is 2.2m. The actual and reconstructed Lena images both from MATLAB simulation and hardware implementation are shown in Fig. 16. The image metrics are reported in Table 6. The MSE for reconstructed images

Clock profiling
The clock profile can be broadly divided into configuration and computation phases. The phase wise clock profile of the architecture along with the throughput of target functions is provided in Table 7. Configuration phase includes control word, coefficient and data loading through the 9-bit input bus. The 54-bit control word is loaded serially and takes 6 clock cycles.

Coefficient Load Latency
The coefficient load latency depends on the nature/amount of processing involved in computing the function. The FIR computation, addition and MAC require coefficients equal to the number of filter taps/terms involved in computation. Therefore, the coefficient load latency is indicated as 9-36 and 3-12 clock cycles, respectively. In case of DWT, eight filter coefficients are loaded in 8 clock cycles. The CORDIC and DCT coefficients (8× 8 mask size) are fixed and stored within the architecture, thus are applied through their respective state machines in 1 clock cycle. The 2-D convolution latency is indicated as 36 clock cycles for the 6 × 6 mask, i.e 36 coefficients.

Data Load Latency
The data load latency depends on the input data requirement of the target functions. The data is loaded into the architecture from the circular memory and take 1 clock cycle for each load operation. The addition and MAC require data equal to the number of terms in the expression and thus takes 3-12 clock cycles. The entire circular memory is loaded in case of sliding window functions (FIR, MA, differentiation, 2-D Convolution) resulting in a latency of 64 clock cycles. For CORDIC computation, the three variables require seed values to initiate the computation and thus the data load latency is 3 clock cycles. DCT and DWT support operation on 8 × 8 sub-blocks and thus takes 64 clock cycles while loading 64 data.
Compute Latency Target functions include functions readily expressed as multiply-accumulate (FIR, Convolution, Multiply, Addition, Multiply-accumulate, Moving Average) and functions that are interpreted as multiply-accumulate after data/algorithm manipulations by means of state machine (CORDIC, DCT, DWT). The compute latency for the former set of functions is 8 clock cycles. In the later set of functions, CORDIC takes 24 cycles, DCT takes 336 cycles and DWT takes 476 cycles for computations following the state machines shown in Figs. 6d, 10 and 9a. The CORDIC computation

Comparison with similar works
The architecture developed in this work acts as a multiple hardware accelerator because of its configurable datapath and adopts effective mapping methodologies translating to gate and clock savings. The comparison of the proposed architecture with the state of the art is provided in Table 8. The work described in [8] advocates the use of accelerators dedicated for commonly used signal processing operations like FIR, CORDIC, moving average integration, etc.
The biomedical signal processor in [8] makes use of the aforementioned functions and the gate count reported for them is 11k, 9.3k and 37k, respectively, with operating frequency of 1MHz for individual execution of these functions. [1] implements ECG signal processing algorithm on a custom microprocessor and reports 36k gates at an operating frequency of 600kHz.
A similar system operating at 1MHz in [32] reports a gate count of 195k gates The proposed work reports a gate count of 24280 gates providing ≈1.5× to 8 × gate (and area) savings with respect to [1] and [32], respectively.
The latency comparison indicates that accelerators in [8] take 32-88 cycles for computing various target functions, whereas the SAC architecture takes 16-574 cycles (0.34-12 ns 46.9M Hz). In both cases, the latency includes the cycles taken in supplying data, transferring output and configuring the architecture. The increased latency in this work is primarily due to the serial nature of the proposed architecture; however, cycle optimizing techniques focused on restricting the data movements are adopted while developing the mapping methodologies. In general, biomedical signal processing applications expect moderate processing rates from the processing platforms due to the nature of biomedical signals (data rates up to 320Kb/s [30]) as well as the data manipulation techniques. This ensures that the upper bound of operating frequency of SAC architecture was sufficient to support a variety of biomedical algorithm processing.

Conclusion
A shift-accumulate-based architecture is developed with configurable datapath that supports dominant DSP functions for biomedical signal processing. The architecture exhibits different topologies for efficient realization of functions. Efficient mapping methodologies are developed for FIR, CORDIC, DWT, DCT, etc., by exploiting algorithm regularities that ensure cycle efficient calculations by eliminating redundant computations and limiting memory requirements. Additionally, SAC architecture contains a configurable datapath that can be leveraged to support algorithm advancements. Additionally, the architecture supports modularity, i.e. multiple SAC architecture units can be connected to exercise various topologies and functionalities.
The architecture is targeted on a state-of-the-art FPGA and demonstrates multiple functions like PTA signal chain, CORDIC, DWT, DCT, etc. The maximum operating frequency supported by the architecture is 46.9MHz. The architecture computes cosine with accuracy ranging from 83 to 98% and a mean square error of ≈0.3 is reported for Lena image as compared to the uncompressed image. The architecture takes 16-574 cycles for target functions. In addition, the gate count of the architecture is estimated as 24280 gates and offers 8 × area advantage making it suitable to be deployed in biomedical applications. The FFT mapping scheme on the architecture is planned as future work. Additional benefits of further improvements in power consumption utilizing dedicated power domains have been proposed in [32,33] yielding energy advantages. The applicability of multiple power domain to the proposed architecture is left as future work. Furthermore, the proposed hardware can benefit from energy efficiency ensured by approximate computing and is also left as future work.

Conflict of interest The authors declare that they have no conflict of interest
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.