1 Introduction

Visual object tracking has got significant research interest in recent years. The purpose of visual tracking is to identify the updated location of the target object in the incoming video sequence, given an initial target location in one frame. It finds its application in several exciting scenarios, including, but not limited to, computer vision, smart video surveillance, robotics, automation. Real-time object tracking is a challenging task whose performance is influenced by various factors, including camera motion, background variations of the scene, and complex motion of the object. To deal with these challenges, sophisticated algorithms with an optimal set of parameters are required to achieve a good degree of accuracy. Moreover, the use of high-resolution cameras further increases the computations required for successful tracking.

At present, visual tracking is mostly performed using software-based platforms, including PCs and embedded processors. However, the frame rate performance of software systems is mostly not enough to support several real-time, mission-critical applications such as tracking for accident prevention, security and defence, etc. Moreover, scale variations and requirements of multi-target tracking limit the use of the serial approach for data-centric applications. Therefore, significant improvement is required at algorithmic and implementation levels to build a real-time, stand-alone object tracking devices for most mission-critical systems. Field programmable gate array (FPGA) is a kind of hardware inherently suited for such applications, thanks to its parallel processing structure, large data throughput interfaces, and integration capability. Existing works on object tracking are either based on discriminative [1,2,3] or generative [4, 5] approaches. The discriminative approaches use machine learning methods to learn the target location employing a filter, which is later used to estimate the target location. The generative approaches deal with creating the statistical model of the target. Studies have shown that discriminative approaches show better performance and require less computation [6,7,8]. An emerging approach is a multi-aspect detection, which considers both the size and location of the target. In this context, two techniques are proposed. The first approach is named as joint scale space tracking and utilizes a 3D correlation filter. The second approach, named multi-resolution tracking, utilizes a 2D filter at multiple resolutions, creating a 3D pyramid for detection. Both approaches are computationally intensive and not suitable for efficient hardware implementation. Recently, in [9], the authors proposed a technique named as discriminative scale space tracking (DSST), which demonstrates a good performance with reasonable complexity. The DSST algorithm achieves better performance using separate filters for translation and scale estimation [9, 10]. At first, the change in the target location is estimated using a translation filter. Next, the updated location is fed to the scale filter to estimate the target size. Finally, both filters are updated for the image frame. The method continues iteratively for the video sequence. The high degree of parallelism of the DSST algorithm makes it a suitable candidate for hardware implementation. The major mathematical operations involved in DSST are singular value decomposition (SVD), QR factorization, two-dimensional discrete Fourier transform (DFT2), and histogram of gradients (HOG). The filter is applied to an image by performing pointwise multiplication with all the pixels. This process is called windowing, which is the most critical minor operation in terms of resources.

The majority of existing works are based on software implementation of these mathematical operators, mainly focusing on the performance rather than hardware resources for higher dimensions. Therefore, there is a substantial requirement for hardware implementation of these operations targeting a complete visual tracking system on a stand-alone device. A survey on hardware implementations on visual object trackers is also provided in [11]. This work deals with the FPGA implementation of major blocks of a real-time DSST algorithm. We propose suitable implementation strategies for the core operations of the DSST algorithm. The implementation of the DSST is carried out using Xilinx Vivado HLS 2016.1 tool with Zedboard with Xilinx Zynq xc7z020clg484-1 System on Chip as target device. The proposed architecture is able to operate at 100 MHz clock frequency for an image of dimensions \(320 \times 240\) pixels. It is able to achieve an estimated mean frame rate of 25.38 fps. The proposed system is fully scalable for higher image dimensions.

The rest of this paper is organized as follows. Section 2 deals with the description of the algorithm and state of the art regarding mathematical operations. Section 3 describes the proposed architecture of these operations. Section 4 deals with the implementation strategies and results. Finally, Sect. 5 concludes the paper.

2 Introduction to discriminative scale space tracking (DSST) algorithm

This section deals with the details of the DSST algorithm [9] and discusses the state-of-the-art implementations for the mathematical operations involved. Algorithm 1 demonstrates the main computation steps of fast DSST (FDSST). The algorithm receives an image and the initial target location as input. An image patch f centered around an initial target location I is extracted using a HOG extractor. These image features are utilized for learning the target translation Discriminative Correlation Filter (DCF). As the domain dimension of f is arbitrary, this can also be used for the scale translation DCF. \(H^{l}\) refers to a filter under which the correlation error between the extracted image patch and the desired output is minimum. The detailed derivation of (1) is provided in [9]. The filter equations are given as

$$\begin{aligned} { H^{l}} = \frac{\overline{G} F^{l}}{\sum _{k=1}^{d} \overline{F^{k}} F^{k} + \lambda }, \quad l = 1,\dots ,d \end{aligned}$$
(1)
$$\begin{aligned} {Y_{t}} =\frac{\sum _{l=1}^{\tilde{d}} \overline{\tilde{A}_{t-1}^{l}} \circ \tilde{Z}_{t}^{l}}{\tilde{B}_{t-1} + \lambda },\quad \forall t \end{aligned}$$
(2)
$$\begin{aligned} {\tilde{A}}_{t}^{l} = \overline{G} \circ {\tilde{U}}_{t}^{l},\quad l = 1,\dots ,{\tilde{d}} \end{aligned}$$
(3)
$$\begin{aligned} \tilde{B}_{t} = (1 - \eta ) \tilde{B}_{t-1} + \eta \sum \nolimits_{k=1}^{{\tilde{d}}} \overline{\tilde{F}_{t}^{k}} \circ \tilde{F}_{t}^{k},\quad \forall t. \end{aligned}$$
(4)

The capital letters denote the Fourier transform of the quantities. All quantities are described in Table 1. Equation (2) is the final equation of the correlation filter. As the approach is iterative, for each new frame the filter is updated. The numerator \({\tilde{A}}_{t}^{l}\) in (2) is updated according to (3) while the denominator \({\tilde{B}}_{t}\) in (2) is updated according to (4). The dimensions are compressed using standard principal component analysis (PCA) to reduce the size of DFTs considering the compression suggested in [9] to realize a fast DSST. Mathematically, it is performed by exploiting a template \(u_{t} = (1 - \eta ) u_{t} + \eta f_{t}\). The tilde terms are obtained by windowing the quantities with P, which is a low-dimension subspace of the features. The compressed dimensions are obtained using the eigenvalue decomposition of the autocorrelation matrix of \(u_{t}\). As shown in Algorithm 1, the compression step is achieved with SVD and QR decomposition. This step is the training or learning step. The filter application or detection part is implemented by (2), which calculates the correlation scores. \(z_{t}\) is a new extracted sample from the estimated 2D target location using HOG extractor. By taking the inverse DFT of \(Y_{t}\) and maximizing the correlation scores, the new target estimation is obtained. This step completes the translation estimation. The scale estimation is obtained by repeating the above steps for scale one-dimensional filter, using the updated target location from translation estimation. This way, searching the scale in the updated location saves a lot of computations. The algorithm performs the scaling, translation, filter estimation and update process, as described in (1), (2), (3) and (4). They involve the scale and translation filter estimation and update equations. As mentioned earlier, the core mathematical operations of DSST algorithm involve DFT2, QR, SVD and HOGs. These operations are discussed as follows.

2.1 DFT and DFT2

The synthesis and analysis equations of discrete Fourier transform (DFT) are given, respectively, as

$$\begin{aligned} X_{k} = \sum \nolimits_{n=0}^{N-1} x_{n} e^{-\frac{j2\pi }{N}kn} x_{k} = \frac{1}{N} \sum \nolimits_{n=0}^{N-1} X_{n} e^{\frac{j2\pi }{N}kn}, \end{aligned}$$
(5)

respectively, where \(W_{N}^{nk}\) = \(e^{-\frac{j2\pi }{N}kn}\) is the twiddle factor, \(X_{n}\) and \(x_{n}\) are complex numbers. The twiddle factor can also be expressed as \(W_{N}^{nk} = \cos (\frac{2\pi n}{N} )-j\cdot \sin (\frac{2\pi n}{N})\) using Euler’s identity.

Table 1 Symbols in DSST [9]
figure c
$$\begin{aligned} X_{k} = \sum \nolimits_{n=0}^{N-1} x_{n} \left[ \cos \left( \frac{2\pi n}{N} \right) -j \cdot \sin \left( \frac{2\pi n}{N}\right) \right] \end{aligned}$$
(6)

A straightforward implementation of N-Point DFT has a complexity of O(\(N^{2}\)) operations. Fast Fourier transform (FFT) is a hardware friendly algorithm which reduces the DFT computations to O(\(N {\mathrm{log}}_{2}N\)), using well-known decimation in time (DIT) and decimation in frequency (DIF) techniques [12]. A number of hardware implementations of FFT exist including parallel [13] and serial approaches [14]. FFT requires \(N=2^{n}\) which is not fixed in this case. So, instead DFT is implemented. An implementation for DFT is provided in [15], which uses the Coordinate Rotation Digital Computer (CORDIC) algorithm to calculate the twiddle factor. Our approach uses precalculated twiddle factors and thus improves performance.

The DFT2 is the two-dimensional Fourier transform applied to a matrix. To compute DFT2, first DFT is applied along the rows of the matrix, and the result is transposed. Then DFT is applied along with the columns. Finally, results are transposed and assigned to the output. In most of the published works, the DFT2 is approximated by two dimensional fast Fourier transform (FFT2). Instead, this work adopts the approach of [16] based on modified row–column decomposition.

2.2 QR factorization

QR factorization is the decomposition of a matrix A of dimensions \(m\times n\) into two matrices, i.e. Q matrix of dimensions \(m\times m\) and R matrix of dimensions \(m \times n\). Q is the orthogonal matrix, while R is the upper triangular matrix. The literature proposes three main approaches for QR factorization, namely, Gram-Schmidt approach [17], Householder transformation [18] and approach based on Givens rotation [19, 20]. This work adopts the third approach, i.e. Givens rotation. The idea is to apply Givens rotations to elements of the lower triangle of A and turn them to zero. When all the lower triangle elements are zeroed, matrix R is obtained. Applying the same transformation to an identity matrix in parallel gives the matrix \(Q^{\rm T}\). The Givens rotation matrix is of the form,

\(G = \begin{bmatrix} c &{} s \\ -s &{} c \\ \end{bmatrix}\), where \(c=\frac{a}{\sqrt{a^{2}+a^{b}}}\) , \(s=\frac{b}{\sqrt{a^{2}+a^{b}}}\), a is the first element of the row pair and b is the element which has to be turned to zero below a. A systolic array-based implementation is proposed in [20]. Our approach is similar, but we focus on resource optimization rather than performance. We also employ row-parallel approach, which improves performance by offering more parallelism.

2.3 SVD

SVD is the eigenvalue decomposition of a matrix A of dimensions \(m \times n\) into three matrices U, S and V of dimensions \(m \times m\), \(m \times n\) and \(n \times n\), respectively. U and V contain the left and right eigenvectors of A, while S is a diagonal matrix containing real eigenvalues. SVD in hardware is mostly implemented using two-sided Jacobi method [21]. The idea is to divide the matrix into \(2 \times 2\) small matrices. Jacobi rotations to elements of the matrix A are applied from left and right, hence the name two-sided Jacobi. This multiplication turns non-diagonal elements to zero giving the matrix S. Similar transformations to the identity matrix gives matrix U and V. The Jacobi rotation matrix is of the form \(\begin{bmatrix} c &{} -s \\ s &{} c \\ \end{bmatrix}\) \(\begin{bmatrix} a &{} b \\ c &{} d \\ \end{bmatrix}\) \(\begin{bmatrix} c &{} s \\ -s &{} c \\ \end{bmatrix}\) = \(\begin{bmatrix} a1 &{} 0 \\ 0 &{} a2 \\ \end{bmatrix}\), where \(c=\cos (\theta )\), \(s=\sin (\theta )\) and \(\theta\) is given by \(\theta = \frac{1}{2} \arctan \frac{c+b}{d-a}\). Fixed point based implementation are given in [21, 22]. Our approach is similar but we focus on time optimization because this unit has small dimensions in the algorithm, so it is operated in parallel.

2.4 Histogram of gradients (HOG)

HOG is a classifier used for target recognition. It is constructed with the help of image gradients and extracts the features contained in image pixels. HOG is a computationally intensive operation, and its implementation is proposed by a number of approaches [23,24,25]. The authors of [23] provide a comparative study of different implementations as well. The implementation in [24] utilizes the least amount of resources but operates below 100 MHz. In [25], an approach is proposed which achieves a frame rate of 60 and requires less amount of hardware resources. It avoids expensive angle calculation using integer multiplication and inequality comparisons.

As a first step, the magnitude and orientation of an image gradient are computed. The magnitude of the gradient is assigned to suitable bins among the nine available. This assignment is done based on the gradient orientation (0–180) and for (0–360). The window for detection is composed of \(8\times 8\) pixels of non-overlapping cells. Afterwards, an aggregate module is utilized for summing the 64 pixels of each bin to create the histogram. Finally, they are passed to a normalization phase. The details of the algorithm are given in [25].

3 Proposed architectures

This section deals with the details of the architecture of the mathematical operations described in the previous section. The flowchart is depicted in Fig. 1. The proposed DSST algorithm has four major steps. The translation search, i.e. the 2D position of target and translation filter update steps involve HOG extraction and DFT2 of 3D matrices. The scale search and scale filter update involve HOG extraction and DFT of a 2D matrix. SVD and QR are involved in translation filter update and scale filter update steps respectively. The DFT unit is the most critical block in terms of performance because of operation on 3D \(320 \times 320\) matrix. Vivado HLS is used as the base tool for synthesizing and simulation.

Fig. 1
figure 1

Flowchart of the DSST algorithm [9]

3.1 Discrete Fourier transform and DFT2

The vector DFT acts as a building block for matrix DFT, i.e. DFT2. The architectures for both are discussed as follows.

3.1.1 DFT

Figure 2 demonstrates the proposed architecture to compute one dimensional DFT. In the first clock cycle i (input loop counter), the sample \({X\_{\mathrm{real}}}\) \(\left[ i \right]\) and \({X\_{\mathrm{imag}}}\) \(\left[ i \right]\) are received along with twiddle factors cos \(\left[ i \right]\) and sin \(\left[ i \right]\). Four multipliers and two adders/subs are required to complete the complex multiplication which is the basic DFT block. This goes to the accumulator (at first, the other input to the accumulator is 0 because the register is at reset). This produces real part of complex multiplication, i.e. \({X\_{\mathrm{real}}}\) \(\left[ i \right] \cdot\) (cos \(\left[ i \right]\)) − \({X\_{\mathrm{imag}}}\) \(\left[ i \right] \cdot\) (sin \(\left[ i \right]\)). Similarly, the other output of accumulator produces \({X\_{\mathrm{real}}}\) \(\left[ i \right] \cdot\) (sin \(\left[ i \right]\)) + \({X\_{\mathrm{imag}}}\) \(\left[ i \right] \cdot\) (cos \(\left[ i \right]\)). Now, in the next clock cycles, the basic DFT receives the inputs X_\(\left[ i+1 \right]\) along with the twiddle factors. After it performs the complex multiplication, the results are accumulated. When the last input is processed, j (the output loop counter which is 0 at first) assigns the first output Y\(\_{\mathrm{real}}\left[ i \right]\), Y\(\_{\mathrm{imag}}\left[ i \right]\) via demultiplexers. In the end, j is incremented and the register is reset. This whole process is repeated until all the N outputs are produced. This approach is serial as it takes N cycles to produce one output and consumes 1 input per cycle.

Fig. 2
figure 2

RTL diagram of DFT 1D unit

Using the basic blocks discussed above, now the architecture is parallelized to support fast DFT for higher image dimensions. This parallelization is shown in the lower part of Fig. 2. By operating eight elements in parallel, Equation (6) can be modified as

$$\begin{aligned} \begin{aligned} X_{k}&= \sum _{n=0}^{N/8-1} x_{n} \left[ \cos \left( \frac{2\pi n}{N} \right) -j\cdot \sin \left( \frac{2\pi n}{N}\right) \right] + \dots + \\&\quad x_{n+7} \left[ \cos \left( \frac{2\pi (n+7)}{N} \right) -j\cdot \sin \left( \frac{2\pi (n+7)}{N}\right) \right] \end{aligned} \end{aligned}$$
(7)

\(x_{n}\) is complex. Each complex multiplication i.e. \(C_{n} = x_{n}\cdot \left[ \cos \left(\frac{2\pi n}{N} \right) -j\cdot \sin \left(\frac{2\pi n}{N}\right) \right]\) is performed by a basic DFT block. The L1 adders add the outputs of basic DFT in pairs of \(\left\{ C_{n}, C_{n+1} \right\} , \left\{ C_{n+2}, C_{n+3} \right\} ,\) \(\left\{ C_{n+4},C_{n+5} \right\} , \left\{ C_{n+6}, C_{n+7} \right\} .\) Let \(M_{n} = C_{n} + C_{n+1}\) then L2 adders add the following pairs: \(\left\{ \left\{ M_{n}, M_{n+1}\right\} , \left\{ M_{n+2}, M_{n+3}\right\} \right\}\). Finally, the L3 adders add the \(M_{n}\) terms to produce 8 point DFT. Accumulator sums the DFTs until N inputs are processed. Similarly to the serial approach when the last input is processed demultiplexers assign the output, j is incremented and accumulators are reset. Thus, with an adder tree between the basic unit and accumulator, 8-parallel DFT is performed. All the DFT coefficients are pre-computed and stored in memory. The HLS compiler mostly handles the intermediate computation results. However, in some cases, we instantiate and partition the BRAMs for intermediate values so, the elements can be processed in parallel. For this unit, the input and output arrays are partitioned into 8 BRAMS. So, eight elements can be accessed in parallel. The general DFT has a delay of \(O(N^{2})\). This architecture improves it to \({\mathrm{delay}} = O(\frac{N^{2}}{8}) + {\text {pipeline stages}}\) \(\approx\) \(O(\frac{N^{2}}{8})\). The architecture is implemented for maximum \(N=320\). If N < 320 then a comparator (not shown in figure for simplicity) limits the number of traverses through the block. For improving performance, task-level parallelism is used with the help of Vivado HLS Dataflow pragma.

3.1.2 DFT2

The proposed DFT2 architecture is shown in Fig. 3 and uses the row–column decomposition approach. The real and imaginary parts are kept separate so that operations run in parallel. The 2D matrix is taken as input. Rows are selected and passed to the DFT calculation blocks first, while the coefficients are precomputed and saved in BRAM blocks. 8-parallel unit is used as a basic block for this unit. The output goes to a transpose unit and afterwards to the memory. In the next steps, relevant outputs are fed along with the weights to 8-parallel DFT block, to compute the column-wise DFT.

Fig. 3
figure 3

RTL diagram of DFT2 unit

The transpose unit has a delay proportional to N. It is given by \(O\left(N^{2} - \sum _{i=1}^{N-1} i)\right)\). It is parallelized to \(O\left(\frac{N^{2}}{8} - \sum _{i=1}^{\frac{N}{8}-1} i)\right)\). The total delay is \(2M \times (T_{\mathrm{DFT}} + T_{\mathrm{Trans}}))\), where M is the number of rows of the matrix. This is nearly equal to \(2M \times (T_{\mathrm{DFT}})\). As this unit lies in the critical path, at the cost of twice the hardware, the maximum delay can be reduced to half that is \(M \times (T_{\mathrm{DFT}})\). It was synthesized for a maximum size of \(320 \times 320\). To save the BRAM resources, the same input matrix is used for saving the outputs. For DFT2 (3D), DFT2 is used as a base unit. The maximum delay will be \(P \times (T_{\mathrm{DFT2}})\), where P is the third dimension, i.e. the number of 2D matrices. The maximum value of P is 18 for fDSST. The same DFT2 unit is used with a divisor in the accumulator before the delay element to divide by N.

3.2 QR factorization

The proposed DSST architecture performs the QR factorization using the Givens Rotation Method [20]. The Vivado HLS QR factorization library [26] is used as a reference unit and modified for our architecture. This implementation is for real numbers and is shown in Fig. 4. Algorithm 2 highlights the main computational steps of QR factorization. The QR factorization unit consists of Givens matrix generation and rotation units. The Givens matrix generation takes as input, two elements from two rows of matrix A and provides the Givens matrix as output. The input 2D matrix A and the Givens matrix G are given by

$$\begin{aligned} A = \begin{bmatrix} a_{11} &{} a_{12} &{} a_{13} &{} a_{14} \\ a_{21} &{} a_{22} &{} a_{23} &{} a_{14} \\ \overbrace{a_{31}} &{} a_{32} &{} a_{33} &{} a_{34} \\ \underbrace{a_{41}} &{} a_{42} &{} a_{43} &{} a_{44} \\ \end{bmatrix} \quad {\mathrm{and}} \quad G = \begin{bmatrix} c &{} s \\ -s &{} c \\ \end{bmatrix}, \end{aligned}$$
(8)

where \(c = \frac{a}{M}\) and \(s = \frac{b}{M}\). Magnitude M is calculated as

$$\begin{aligned} M = \sqrt{a^{2} + b^{2}} = x \times \sqrt{1 + y \times y}, \end{aligned}$$
(9)

where \(x = \max (a,b)\) and \(y = \frac{ \min (a,b)}{x}\).

Fig. 4
figure 4

RTL diagram of QR factorization

Lets take \(a = a_{31}\) and \(b = a_{41}\). With the help of (9) matrix G is calculated. Equation (9) is implemented by magnitude calculation unit shown in Fig. 5. In this unit, the maximum of the two inputs is determined using a comparator and two multiplexers. Afterwards, the divider calculates y. In the end, magnitude M is obtained by a combination of adder, multiplier and square root unit. To obtain matrix G, c and s are determined. As the magnitude appears in the denominator, divide by zero is checked using a comparator and a multiplexer. Matrix R is obtained by turning the lower triangular elements of matrix A to zero. For this purpose, the Givens rotation is applied in the following manner:

$$\begin{aligned} G \cdot \begin{bmatrix} a_{31} a_{41}\\ \end{bmatrix}^{\rm T} = \begin{bmatrix} a_{31}{*} 0\\ \end{bmatrix}^{\rm T}. \end{aligned}$$
(10)

The Givens rotation block is shown on the left of Fig. 5. It is simply a \(2\times 2\) matrix to vector multiplication. If the second element b is already zero, then Givens rotation is simply \(a = M\) and \(b = 0\). This assignment is implemented with multiplexers. Givens rotation unit is accompanied by a comparator to avoid assigning wrong values to zeroed positioned elements shown in the top left of Fig. 5. In that case, the first element is equal to the magnitude. For complete matrix R now select \(a = a_{21}\) and \(b =a_{31}*\) and repeat until all the lower triangular elements are turned to zero. The matrix Q is obtained by performing the same generated rotations to an identity matrix. A small address generation unit is used for row selection. Row pairs are selected to be operated in parallel. The rotations are applied to all the columns of the selected rows.

Fig. 5
figure 5

RTL diagram of magnitude calculator for QR

A single matrix is used for input A and output R to reduce the number of resources. The critical path is in the Givens generation block because of the magnitude unit. This block has a division and square root operator. The generated Verilog code can be modified at RTL level to pipeline the architecture. Also, the number of parallel rotations impact resources and performance. Resource optimization is employed since this unit is not on the critical path. The generation and rotation blocks are parallelized by a factor of two and pipelined with an initiation interval of 4.

3.3 SVD

Fig. 6
figure 6

RTL diagram of SVD

This work uses the two-sided Jacobi method [21] to perform SVD. The Vivado HLS library [26] is adopted as a reference unit and optimized for the target application. This implementation is for real numbers. The implementation is shown in Fig. 6. Algorithm 3 demonstrates the execution of SVD. The SVD unit consists of the diagonal processor (DP) and non-diagonal processor. The DP takes as input the 2D matrix A, divided into N/2 \(2\times 2\) submatrices. Matrix A is the same as Eq. (8) and Jacobi left and right matrices are given by \(u = \begin{bmatrix} c1 &{} s1 \\ -s1 &{} c1 \\ \end{bmatrix}\) and \(v = \begin{bmatrix} c2 &{} s2 \\ -s2 &{} c2 \\ \end{bmatrix}\), where \(c = \cos (\theta )\) and \(s= \sin (\theta )\). For \(\theta\), we have \(\theta = \frac{1}{2} \arctan \frac{b+c}{d-a}\) to avoid calculation of arctan, consider

$$\begin{aligned} \tan (2 \theta ) = \frac{2b}{({d-a})}. \end{aligned}$$
(11)
figure d

The implementation of angle calculation is shown in Fig. 7. A divisor depicted in the figure generates \(\tan (2 \theta )\). By the use of trigonometric identities, cos and sine are derived. They are as under

$$\begin{aligned} \begin{aligned} \cos (\theta )& = \frac{1}{\sqrt{1+\left(\tan ^{2}( \theta )\right)}},\\ \sin (\theta )& = \cos ( \theta ). \tan ( \theta ),\\ \tan \left(\frac{\theta }{2}\right)& = \frac{(1-\cos (\theta ))}{\sin (\theta )}.\\ \end{aligned} \end{aligned}$$
(12)

The computation of half-angle identities is shown in the left of Fig. 7. Angle calculator is shown in the right of Fig. 7, which consists of a tree of multiplexers to select the correct angle based on whether the number is real, imaginary or complex.

Fig. 7
figure 7

RTL diagram of angle calculator

The Jacobi matrix generation is shown in Fig. 8. The numerator and denominator of (11) are computed via adder and subtractor and passed to the angle calculator unit. The Jacobi matrices are generated using \(c = \cos (\alpha \mp \beta )\) and \(s = \cos (\alpha \pm \beta )\) identities, where \(\alpha\) and \(\beta\) are half angles calculated previously. The identities are implemented with simple vector multiplier. The Jacobi rotations are given by

$$\begin{aligned} \begin{bmatrix} c1 &{} -s1 \\ s1 &{} c1 \\ \end{bmatrix} \begin{bmatrix} a11 &{} a12 \\ a21 &{} a22 \\ \end{bmatrix} \begin{bmatrix} c2 &{} s2 \\ -s2 &{} c2 \\ \end{bmatrix} = \begin{bmatrix} a11 &{} 0 \\ 0 &{} a22 \\ \end{bmatrix}. \end{aligned}$$
(13)

Thus Jacobi rotations are simple matrix multiplications and are implemented by vector multipliers in Fig. 8.

As all the building blocks are described, next is the demonstration of how the three U, S and V matrices are generated. As shown in Fig. 6, SVD consists of DP and non-DP. The quantities used are defined as follows: A_s denotes diagonal submatrix of A, terms with DP mean newly updated submatrix from DP unit and terms with I indicate they are from identity matrix. DP\(\_c\) and col represent \(2\times 2\) submatrix the current iteration and currently selected column pairs, respectively. The algorithm is repeated for a minimum number of iterations to achieve convergence. Literature suggests that 6–10 iterations are enough for it. In our case, for a dimension of 32, the iteration factor is 6. The iteration factor is determined from the table given in [21].

The diagonal processor receives a \(2\times 2\) main diagonal submatrix A_s of matrix A. It has a swap operator implemented with a mux. The swap operator swaps the columns to arrange the diagonal elements in ascending order as SVD requires it. After the swap, DP generates the Jacobi matrices and applies the rotations. A_s is rotated by DP\(\_v\) first and then it is post multiplied by DP\(\_u\). Now the non-diagonal elements of A_s are turned to zero. DP now outputs the new matrices DP\(\_s\), DP\(\_u\) and DP\(\_v\) to the non-DP. Non-DP in Fig. 6 receives 2 identity matrices U, V and matrix A. It also receives DP\(\_u\), DP\(\_s\) and DP\(\_v\) from the DP. Now there are two subcases, current \(2\times 2\) submatrix DP\(\_c\) is less than current column indices col and DP\(\_c\) is greater than col. In the first case, A_s is pre-rotated by Hermitian transposed DP\(\_u\) then post-rotated by DP\(\_v\). In the second case, A_s is pre-rotated by DP\(\_v\) then post-multiplied by Hermitian transposed DP\(\_u\). If the submatrix overlaps with DP\(\_s\), then DP\(\_s\) values from DP are used otherwise from matrix A. This process is repeated until the matrix A is diagonalized. This diagonalized matrix is the matrix S containing the eigenvalues. Matrix U (left eigenvectors) and V (right eigenvectors) are obtained by applying the same DP\(\_u\) and DP\(\_v\) rotations to identity matrices I_u and I_v, respectively.

Fig. 8
figure 8

RTL diagram of SVD \(2\times 2\) unit

figure e

A single matrix is used for input A and output U to reduce the number of resources. The critical path is in the DP block because of the SVD \(2\times 2\) unit, which has the angle calculator. The generated Verilog code can be modified at the RTL level to pipeline the architecture as the critical path is too long. At the cost of resources, frequency and latency optimization is employed. The 2 generations and rotation blocks are used in parallel and pipelined with the initiation interval of 8.

In summary, initially, Jacobi’s left and right rotations are generated. They are applied to original matrix A to obtain S. Left rotations on identity matrix produce Matrix U while the right rotations produce matrix \(V^{\mathrm{T}}\). But this approach can only be applied to symmetric matrices. If the matrix is not symmetric, then a further step is needed to symmetrize the matrix. The symmetrization can be done with the help of Givens rotations.

3.4 Histogram of gradients (HOG)

The architecture suggested relies on the implementation provided by the authors of [25]. The 2D grayscale image is taken as input. Gradients px and py are calculated for each pixel in both x and y directions, respectively, using two subtractors. The gradient magnitude and direction are given by

$$\begin{aligned} M =\sqrt{p_{x}^{2}+p_{y}^{2}}, \\ \tan (\theta ) = \frac{p_{y}}{p_{x}}. \end{aligned}$$

The angle (0–180) is used to assign a gradient magnitude to 9 different bins [24]. The histogram is built from these bins. To save resources angle calculation, i.e. the orientation of the gradients is computed using integer multiplications instead of division. This improvement is achieved by performing angle calculation and bin assignments together. If the angle lies between any one of the nine available slots, then that gradient magnitude is assigned to the corresponding bin. As demonstrated by authors of [25], the gradient magnitude is assigned to bin one if the following inequality holds:

$$\begin{aligned} 0 \le {p_{y}}/{p_{x}}\le \tan (20) \rightarrow 0 \le p_{y} \le p_{x} \tan (20). \end{aligned}$$

This step simply involves a multiplication of integer constant to satisfy the inequality. This step helps save resources because it avoids a calculation of \(\tan (\theta )\), which involves divisions. Based on the orientation, the gradients are assigned to 9 different bins. They are the contrast insensitive bins. Afterwards, they are aggregated for smoothening among all bins. At last, they are normalized using L1-norm. As compared to the L2-norm, L1-norm avoids squaring. Also, instead of dividing the reciprocal is multiplied. This improvement further saves hardware resources without sacrificing accuracy too much. To obtain Felzenszwalb’s HOG, two extra steps need to be performed. First bin assignment step is performed again. This time the gradients are assigned to 18 different bins based on orientation (0–360). They are the contrast-sensitive bins. These 18 bins from each block are averaged together. Also, the previously calculated 9 bins are averaged for each block. For 4 blocks, the 9 normalized bin elements are also averaged. These 18 directional, 9 non-directional and 4 normalization bins form the 31 third dimensional features of fHOG. After this step, the output is assigned. The implementation diagram is demonstrated in [25].

4 Implementation results and discussion

The data path of the proposed DSST system is specified using the Vivado HLS tool. The prototyping is performed on Zedboard with Xilinx Zynq xc7z020clg484-1 System-on-Chip. The block-wise implementation results are discussed as follows.

4.1 Discrete Fourier transform

DFT 1D has maximum size \(N=320\). For DFT, the results are compared with [15]. Vivado HLS post-synthesis timing and resources results, referred to as HLS, are shown in Tables 2, 3 and 4. HLS timing results in Table 2 outperform the one in [15] by factor of 92%. This improvement is because of using precalculated twiddle factors. Also, in contrast to the implementation in [15], our implementation is extensively unrolled. Two different implementations are provided here, serial S_HLS and parallel P_HLS. Resources in Table 3 indicate that our serial DFT uses less LUTs than the one in [15] but uses twice the DSP48E units. The higher number of resources is because of separate hardware for imaginary and real parts. Also, for 8 parallel DFT, our resource consumption is much higher than [15]. The higher resource consumption is justifiable because optimization is done for performance in the case of higher dimensions. The maximum operating frequency for 8 parallel DFT is 112 MHz while [15] operates at 50 MHz. The frequency can be further improved by pipelining the Verilog code generated at the RTL level. It cannot be improved in Vivado HLS as the pipeline commands unroll the architecture in the scope available. So only inner loops are pipelined.

Table 2 Timing results for the FPGA-based DFT in us
Table 3 Area results for the DFT implementation in FPGA
Table 4 AREA and TIME results of DFT and DFT2

The simulation results were compared against MATLAB-generated golden matrices. The relative percentage error is computed by \(E = \frac{(R_{\mathrm{MATLAB}}-R_{\mathrm{HLS}})}{R_{\mathrm{MATLAB}}}\cdot 100\), where E is the error. The golden matrices were generated using intermediate values of the algorithm at the input of each block. This is then applied to HLS-based units. Afterward, the outputs of both are compared. The average error values, in this case, were 6.52 and 6.9 for real and imaginary matrices, respectively. This is because MATLAB uses double-precision values. Our results are still approximate enough because double precision in hardware will consume four times more resources. Also, the latency will be much slower. DFT2 is implemented for maximum dimensions of \(320\times 320\). The results for DFT and DFT2 are reported in Table 4. It has a maximum frequency of 112 MHz.

4.2 QR factorization

This unit has the maximum dimensions of \(800\times 17\). For QR, the results are compared with [20] where implementation is for a \(4\times 4\) matrix. QR [20] is fully unrolled and uses fixed-point iterations while the approach is based on floating-point. The comparison can be made from the 32-bit fixed-point version. A comparison of the timing results from Table 5 indicates that HLS-based approach takes 4 times more clock cycles than the one in [20]. But HLS based area results are significantly better. This approach takes 2.3 times less DSP48E resources as this is not the critical block, so resource optimization was our target. The operating frequency is almost similar to the non-pipelined version. Our approach is generic while the one in [20] is a fixed size. The worst-case delay for an input of size \(800\times 17\) is 337 µs for QR economy. At the same time, it consumes 26 DSP48Es resources. The maximum frequency is 116 MHz. The error is calculated using the same procedure as in Sect. 4.1. The average error value is 0.034 for matrix Q as only matrix Q is needed for the DSST algorithm. Our results are approximate enough because double precision used by MATLAB in hardware will consume four times more resources.

Table 5 TIME and area for FPGA-based \(4\times 4\) QR
Table 6 TIME results for SVD implementation in FPGA
Table 7 AREA results for SVD implementation in FPGA

4.3 SVD

This unit is implemented for maximum dimensions of \(32\times 32\). For SVD, the results are compared with [21, 22]. The results are compared for low values of N with [21] and for higher values with [22]. Authors of [21, 22] use fixed-point iterations while we have a floating-point implementation. Comparison of the timing results from Table 6 indicates HLS-based approach performs better than both implementations for all dimensions except for \(N=4\) with [21]. Timing is improved by nearly a factor of 3.7 and 4.8 as compared to [22] and [21], respectively. But HLS based area consumption is 2 times higher, as shown in Table 7. The reason being our angle calculation unit uses division and the square root of floating-point numbers, while [21, 22] uses the CORDIC algorithm for it. The worst-case delay for an input of size \(32\times 32\) is 4 ms. At the same time, it consumes 30 DSP48Es resources. The maximum frequency is 108 MHz. The error is calculated similarly as in Sect. 4.1. The average error value is 7.156 for matrix U. As only matrix U is needed for the DSST algorithm. Our results are approximate enough because double precision used by MATLAB in hardware will consume more resources. Also, the latency will be higher.

Table 8 Area, fps and power results of proposed architecture for DSST algorithm

4.4 Histogram of gradients (HOG)

The maximum dimensions of this unit are \(320 \times 240\). The results of [25] are reported here. The maximum operating frequency is 270 MHz. 60 fps for an image size of \(1920\times 1080\). It consumes only 12 DSP blocks.

4.5 Overall resources and speed

Table 8 shows the maximum resources used by the units in terms of DSP48E, BRAM, FF and LUTs. Also, the maximum power for each unit is displayed. Power is reported using power reports of post place-and-route from VIVADO HLS. The values are for the maximum dimensions of each unit. The speed/frames per second (fps) is calculated for each unit separately. For each block, an average fps is considered using a range of sizes as input. The fps is calculated as

\({\mathrm{fps}} = \frac{F_{\mathrm{max}}}{{\mathrm{Cycles}}}\), where cycles is the number of clock cycles required to process one frame. Table 8 shows the mean fps of each unit. As for the whole architecture, there are four stages, namely, Scale Search (SS), Scale Filter (SF), Translation Search (TS), and Translation Filter (TF). Table 9 shows the mean fps of each stage.Total fps for a stage is computed by adding reciprocal fps of units involved in the corresponding stage. The critical stages are the TS and TF stages, as they involve DFT2 units. Using 2-parallel architecture for DFT2, the fps is improved. This is shown as TS2 and TF2 in Table 9. The most critical stage (stage with the minimum fps) dictates the mean fps of the whole architecture. Thus, for an image size of \(320\times 240\), it can fit in Zync Zed board with a frame rate of 25.38 fps. The maximum frequency is 108 MHz. Increasing the image size in the same FPGA would decrease fps. If frame size doubles, the fps is lowered by a factor of 1.5. With a different FPGA, the same fps can be achieved at the cost of higher resources. For the input and output images, (hls::Mat) from HLS is used. They are transformed into an 8-bit integer matrix for processing.

Table 9 FPS results for proposed DSST architecture for 240 \(\times\) 320 image
Table 10 Comparison against other implementations

Table 10 shows the comparison of our whole HLS architecture against other tracker implementations. Our frame rate is equal to the original DSST [9], but it is less than the fDSST tracker. In terms of power HLS-based solution is much better as the fDSST runs on Xeon CPU and uses more resources. Another correlation filter-based tracker [27] reported is for IoT applications and is implemented for edge devices. It relies on the server for computations, which helps it speed up then our HLS based approach. Again in terms of power our solution is much better. The authors of Rotation Aware DSST tracker [28] use DSST but also integrate rotation awareness for accurate scale estimation. Our solution dominates both in terms of power and speed the RADSST [28]. MOSSE [29] is not DSST-based tracker but relies on programmable logic. It has a higher speed at 4K resolution, but it is reported to compare power. Again for power HLS solution is better. The best fps is of Moving Target Tracker MTT [30] which uses SIMD based DSP platform. It also uses 4GB of RAM. So in terms of power, resources and portability our solution is better. In comparison to these trackers, HLS based solution consumes less power. The speed is comparable to some of the trackers, but the fewer resources make it feasible to operate on the field.

5 Conclusions

In this paper, the RTL level implementation of the major blocks of Discriminative Scale Space Tracker (DSST) [9] is presented. The implementation is given in terms of the major mathematical operations involved including SVD, QR, DFT2 and HOG extractor. DFT is implemented by 8-parallel architecture; this is the base for DFT2 unit. This approach improves the timing by 92% with increased resources. For QR, the resource consumption is improved by a factor of 2.3 compared to [20]. For SVD, the timing is improved by a factor of nearly 3.8 compared to [22]. For an image size of \(320\times 240\) it is able to fit in Zync Zed board with a mean frame rate of 25.38 fps, thus can be operated as a stand-alone unit. Future research can be performed at further optimizing the operations involved. The effort can also be made to integrate the whole algorithm implementation and interface with a real camera to test it on the field. Finally, the overall accuracy can be compared to the databases available.