Real-time implementation of fast discriminative scale space tracking algorithm

Real-time object tracking is an important step of many modern image processing applications. The efficient hardware design of real-time object tracker must achieve the desired accuracy while satisfying the frame rate requirements for a variety of image sizes. The existing methods of visual tracking employ sophisticated algorithms and challenge the capabilities of most embedded architectures. Discriminative scale space tracking is one algorithm that is capable of demonstrating good performance with affordable complexity. It has a high degree of parallelism which can be exploited for efficient implementation of reconfigurable hardware architectures. This paper proposes a real-time implementation of the discriminative scale-space tracker on FPGA for the major blocks. A careful design exploration of core mathematical operations of the tracking algorithm is performed to improve their hardware utilization and timing performance. Among the core functional units optimized in this work, the discrete Fourier transform achieves a computational time improvement of 92% relative to existing works, QR factorization achieves a 2.3×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} reduction in resource utilization, and singular value decomposition yields a 3.8×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} improvement in processing time. The proposed data path architecture is designed using Vivado HLS tool set and implemented for Zync Zed Board (xc7z020clg484-1). For an input image size of 320 ×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times$$\end{document} 240, the proposed architecture achieves a mean 25.38 fps.


Introduction
Visual object tracking has got significant research interest in recent years. The purpose of visual tracking is to identify the updated location of the target object in the incoming video sequence, given an initial target location in one frame. It finds its application in several exciting scenarios, including, but not limited to, computer vision, smart video surveillance, robotics, automation. Real-time object tracking is a challenging task whose performance is influenced by various factors, including camera motion, background variations of the scene, and complex motion of the object. To deal with these challenges, sophisticated algorithms with an optimal set of parameters are required to achieve a good degree of accuracy. Moreover, the use of high-resolution cameras further increases the computations required for successful tracking.
At present, visual tracking is mostly performed using software-based platforms, including PCs and embedded processors. However, the frame rate performance of software systems is mostly not enough to support several real-time, mission-critical applications such as tracking for accident prevention, security and defence, etc. Moreover, scale variations and requirements of multi-target tracking limit the use of the serial approach for data-centric applications. Therefore, significant improvement is required at algorithmic and implementation levels to build a real-time, stand-alone object tracking devices for most mission-critical systems. Field programmable gate array (FPGA) is a kind of hardware inherently suited for such applications, thanks to its parallel processing structure, large data throughput interfaces, and integration capability. Existing works on object tracking 1 3 are either based on discriminative [1][2][3] or generative [4,5] approaches. The discriminative approaches use machine learning methods to learn the target location employing a filter, which is later used to estimate the target location. The generative approaches deal with creating the statistical model of the target. Studies have shown that discriminative approaches show better performance and require less computation [6][7][8]. An emerging approach is a multi-aspect detection, which considers both the size and location of the target. In this context, two techniques are proposed. The first approach is named as joint scale space tracking and utilizes a 3D correlation filter. The second approach, named multiresolution tracking, utilizes a 2D filter at multiple resolutions, creating a 3D pyramid for detection. Both approaches are computationally intensive and not suitable for efficient hardware implementation. Recently, in [9], the authors proposed a technique named as discriminative scale space tracking (DSST), which demonstrates a good performance with reasonable complexity. The DSST algorithm achieves better performance using separate filters for translation and scale estimation [9,10]. At first, the change in the target location is estimated using a translation filter. Next, the updated location is fed to the scale filter to estimate the target size. Finally, both filters are updated for the image frame. The method continues iteratively for the video sequence. The high degree of parallelism of the DSST algorithm makes it a suitable candidate for hardware implementation. The major mathematical operations involved in DSST are singular value decomposition (SVD), QR factorization, twodimensional discrete Fourier transform (DFT2), and histogram of gradients (HOG). The filter is applied to an image by performing pointwise multiplication with all the pixels. This process is called windowing, which is the most critical minor operation in terms of resources.
The majority of existing works are based on software implementation of these mathematical operators, mainly focusing on the performance rather than hardware resources for higher dimensions. Therefore, there is a substantial requirement for hardware implementation of these operations targeting a complete visual tracking system on a standalone device. A survey on hardware implementations on visual object trackers is also provided in [11]. This work deals with the FPGA implementation of major blocks of a real-time DSST algorithm. We propose suitable implementation strategies for the core operations of the DSST algorithm. The implementation of the DSST is carried out using Xilinx Vivado HLS 2016.1 tool with Zedboard with Xilinx Zynq xc7z020clg484-1 System on Chip as target device. The proposed architecture is able to operate at 100 MHz clock frequency for an image of dimensions 320 × 240 pixels. It is able to achieve an estimated mean frame rate of 25.38 fps. The proposed system is fully scalable for higher image dimensions.
The rest of this paper is organized as follows. Section 2 deals with the description of the algorithm and state of the art regarding mathematical operations. Section 3 describes the proposed architecture of these operations. Section 4 deals with the implementation strategies and results. Finally, Sect. 5 concludes the paper.

Introduction to discriminative scale space tracking (DSST) algorithm
This section deals with the details of the DSST algorithm [9] and discusses the state-of-the-art implementations for the mathematical operations involved. Algorithm 1 demonstrates the main computation steps of fast DSST (FDSST). The algorithm receives an image and the initial target location as input. An image patch f centered around an initial target location I is extracted using a HOG extractor. These image features are utilized for learning the target translation Discriminative Correlation Filter (DCF). As the domain dimension of f is arbitrary, this can also be used for the scale translation DCF. H l refers to a filter under which the correlation error between the extracted image patch and the desired output is minimum. The detailed derivation of (1) is provided in [9]. The filter equations are given as The capital letters denote the Fourier transform of the quantities. All quantities are described in Table 1. Equation (2) is the final equation of the correlation filter. As the approach is iterative, for each new frame the filter is updated. The numerator Ã l t in (2) is updated according to (3) while the denominator B t in (2) is updated according to (4). The dimensions are compressed using standard principal component analysis (PCA) to reduce the size of DFTs considering the compression suggested in [9] to realize a fast DSST. Mathematically, it is performed by exploiting a template u t = (1 − )u t + f t . The tilde terms are obtained by windowing the quantities with P, which is a low-dimension subspace of the features. The compressed dimensions are obtained using the eigenvalue decomposition of the autocorrelation matrix of u t . As shown in Algorithm 1, the compression step is achieved with SVD and QR decomposition. This step is the training or learning step. The filter application or detection part is implemented by (2), which calculates the correlation scores. z t is a new extracted sample from the estimated 2D target location using HOG extractor. By taking the inverse DFT of Y t and maximizing the correlation scores, the new target estimation is obtained. This step completes the translation estimation. The scale estimation is obtained by repeating the above steps for scale one-dimensional filter, using the updated target location from translation estimation. This way, searching the scale in the updated location saves a lot of computations. The algorithm performs the scaling, translation, filter estimation and update process, as described in (1), (2), (3) and (4). They involve the scale and translation filter estimation and update equations. As mentioned earlier, the core mathematical operations of DSST algorithm involve DFT2, QR, SVD and HOGs. These operations are discussed as follows.

DFT and DFT2
The synthesis and analysis equations of discrete Fourier transform (DFT) are given, respectively, as respectively, where W nk N = e − j2 N kn is the twiddle factor, X n and x n are complex numbers. The twiddle factor can also be expressed as W nk N = cos( 2 n N ) − j ⋅ sin( 2 n N ) using Euler's identity. if t = 1 then 3: Translation estimation: 4: Extract z t,trans ← I t at {p t−1 , s t−1 } using HOG 5: Z t,trans ← z t,trans using DFT2 and compute correlation scores y t,trans using (2) 6: Set p t = max{y t,trans } 7: Scale estimation: 8: Extract z t,scale ← I t at {p t , s t−1 } using HOG 9: Z t,scale ← z t,scale using DFT2 and compute correlation scores y t,scale using (2) 10: Set s t = max{y t,scale } 11: end if 12: Model update: 13: A straightforward implementation of N-Point DFT has a complexity of O(N 2 ) operations. Fast Fourier transform (FFT) is a hardware friendly algorithm which reduces the DFT computations to O(Nlog 2 N ), using well-known decimation in time (DIT) and decimation in frequency (DIF) techniques [12]. A number of hardware implementations of FFT exist including parallel [13] and serial approaches [14]. FFT requires N = 2 n which is not fixed in this case. So, instead DFT is implemented. An implementation for DFT is provided in [15], which uses the Coordinate Rotation Digital Computer (CORDIC) algorithm to calculate the twiddle factor. Our approach uses precalculated twiddle factors and thus improves performance.
The DFT2 is the two-dimensional Fourier transform applied to a matrix. To compute DFT2, first DFT is applied along the rows of the matrix, and the result is transposed. Then DFT is applied along with the columns. Finally, results are transposed and assigned to the output. In most of the published works, the DFT2 is approximated by two dimensional fast Fourier transform (FFT2). Instead, this work x n cos 2 n N − j ⋅ sin 2 n N

QR factorization
QR factorization is the decomposition of a matrix A of dimensions m × n into two matrices, i.e. Q matrix of dimensions m × m and R matrix of dimensions m × n . Q is the orthogonal matrix, while R is the upper triangular matrix. The literature proposes three main approaches for QR factorization, namely, Gram-Schmidt approach [17], Householder transformation [18] and approach based on Givens rotation [19,20]. This work adopts the third approach, i.e. Givens rotation. The idea is to apply Givens rotations to elements of the lower triangle of A and turn them to zero. When all the lower triangle elements are zeroed, matrix R is obtained. Applying the same transformation to an identity matrix in parallel gives the matrix Q T . The Givens rotation matrix is of the form, a is the first element of the row pair and b is the element which has to be turned to zero below a. A systolic array-based implementation is proposed in [20]. Our approach is similar, but we focus on resource optimization rather than performance. We also employ row-parallel approach, which improves performance by offering more parallelism.

SVD
SVD is the eigenvalue decomposition of a matrix A of dimensions m × n into three matrices U, S and V of dimensions m × m , m × n and n × n , respectively. U and V contain the left and right eigenvectors of A, while S is a diagonal matrix containing real eigenvalues. SVD in hardware is mostly implemented using two-sided Jacobi method [21]. The idea is to divide the matrix into 2 × 2 small matrices. Jacobi rotations to elements of the matrix A are applied from left and right, hence the name two-sided Jacobi. This multiplication turns non-diagonal elements to zero giving the matrix S. Similar transformations to the identity matrix gives matrix U and V. The Jacobi rotation matrix is of the form and is given by = 1 2 arctan c+b d−a . Fixed point based implementation are given in [21,22]. Our approach is similar but we focus on time optimization because this unit has small dimensions in the algorithm, so it is operated in parallel.

Histogram of gradients (HOG)
HOG is a classifier used for target recognition. It is constructed with the help of image gradients and extracts the features contained in image pixels. HOG is a computationally intensive operation, and its implementation is proposed by a number of approaches [23][24][25]. The authors of [23] provide a comparative study of different implementations as well. The implementation in [24] utilizes the least amount of resources but operates below 100 MHz. In [25], an approach is proposed which achieves a frame rate of 60 and requires less amount of hardware resources. It avoids expensive angle calculation using integer multiplication and inequality comparisons.
As a first step, the magnitude and orientation of an image gradient are computed. The magnitude of the gradient is assigned to suitable bins among the nine available. This assignment is done based on the gradient orientation (0-180) and for (0-360). The window for detection is composed of 8 × 8 pixels of non-overlapping cells. Afterwards, an aggregate module is utilized for summing the 64 pixels of each bin to create the histogram. Finally, they are passed to a normalization phase. The details of the algorithm are given in [25].

Proposed architectures
This section deals with the details of the architecture of the mathematical operations described in the previous section. The flowchart is depicted in Fig. 1. The proposed DSST algorithm has four major steps. The translation search, i.e. the 2D position of target and translation filter update steps involve HOG extraction and DFT2 of 3D matrices. The scale search and scale filter update involve HOG extraction and DFT of a 2D matrix. SVD and QR are involved in translation filter update and scale filter update steps respectively. The DFT unit is the most critical block in terms of performance because of operation on 3D 320 × 320 matrix. Vivado HLS is used as the base tool for synthesizing and simulation.

Discrete Fourier transform and DFT2
The vector DFT acts as a building block for matrix DFT, i.e. DFT2. The architectures for both are discussed as follows.

DFT
). Now, in the next clock cycles, the basic DFT receives the inputs X_[i + 1] along with the twiddle factors. After it performs the complex multiplication, the results are accumulated. When the last input is processed, j (the output loop counter which is 0 at first) assigns the first output Y_real[i] , Y_imag[i] via demultiplexers. In the end, j is incremented and the register is reset. This whole process is repeated until all the N outputs are produced. This approach is serial as it takes N cycles to produce one output and consumes 1 input per cycle.
Using the basic blocks discussed above, now the architecture is parallelized to support fast DFT for higher image dimensions. This parallelization is shown in the lower part of Fig. 2. By operating eight elements in parallel, Equation (6) can be modified as ) . The architecture is implemented for maximum N = 320 . If N < 320 then a comparator (not shown in figure for simplicity) limits the number of traverses through the block. For improving performance, task-level parallelism is used with the help of Vivado HLS Dataflow pragma.

DFT2
The proposed DFT2 architecture is shown in Fig. 3 and uses the row-column decomposition approach. The real and imaginary parts are kept separate so that operations run in parallel. The 2D matrix is taken as input. Rows are selected and passed to the DFT calculation blocks first, while the coefficients are precomputed and saved in BRAM blocks. 8-parallel unit is used as a basic block for this unit. The output goes to a transpose unit and afterwards to the memory.
In the next steps, relevant outputs are fed along with the weights to 8-parallel DFT block, to compute the columnwise DFT.
The transpose unit has a delay proportional to N. It is where M is the number of rows of the matrix. This is nearly equal to 2M × (T DFT ) . As this unit lies in the critical path, at the cost of twice the hardware, the maximum delay can be reduced to half that is M × (T DFT ) . It was synthesized for a maximum size of 320 × 320 . To save the BRAM resources, the same input matrix is used for saving the outputs. For DFT2 (3D), DFT2 is used as a base unit. The maximum delay will be P × (T DFT2 ) , where P is the third dimension, i.e. the number of 2D matrices. The maximum value of P is 18 for fDSST. The same DFT2 unit is used with a divisor in the accumulator before the delay element to divide by N.

QR factorization
The proposed DSST architecture performs the QR factorization using the Givens Rotation Method [20]. The Vivado HLS QR factorization library [26] is used as a reference unit and modified for our architecture. This implementation is for real numbers and is shown in Fig. 4. x . Lets take a = a 31 and b = a 41 . With the help of (9) matrix G is calculated. Equation (9) is implemented by magnitude calculation unit shown in Fig. 5. In this unit, the maximum of the two inputs is determined using a comparator and two multiplexers. Afterwards, the divider calculates y. In the end, magnitude M is obtained by a combination of adder, multiplier and square root unit. To obtain matrix G, c and s are determined. As the magnitude appears in the denominator, divide by zero is checked using a comparator and a multiplexer. Matrix R is obtained by turning the lower triangular elements of matrix A to zero. For this purpose, the Givens rotation is applied in the following manner:   It is simply a 2 × 2 matrix to vector multiplication. If the second element b is already zero, then Givens rotation is simply a = M and b = 0 . This assignment is implemented with multiplexers. Givens rotation unit is accompanied by a comparator to avoid assigning wrong values to zeroed positioned elements shown in the top left of Fig. 5. In that case, the first element is equal to the magnitude. For complete matrix R now select a = a 21 and b = a 31 * and repeat until all the lower triangular elements are turned to zero. The matrix Q is obtained by performing the same generated rotations to an identity matrix. A small address generation unit is used for row selection. Row pairs are selected to be operated in parallel. The rotations are applied to all the columns of the selected rows.
A single matrix is used for input A and output R to reduce the number of resources. The critical path is in the Givens generation block because of the magnitude unit. This block has a division and square root operator. The generated Verilog code can be modified at RTL level to pipeline the architecture. Also, the number of parallel rotations impact resources and performance. Resource optimization is employed since this unit is not on the critical path. The generation and rotation blocks are parallelized by a factor of two and pipelined with an initiation interval of 4.

SVD
This work uses the two-sided Jacobi method [21] to perform SVD. The Vivado HLS library [26] is adopted as a reference unit and optimized for the target application. This implementation is for real numbers. The implementation is shown in Fig. 6. Algorithm 3 demonstrates the execution of SVD. The SVD unit consists of the diagonal processor (DP) and non-diagonal processor. The DP takes as input the 2D matrix A, divided into N/2 2 × 2 submatrices. Matrix A is the same as Eq. (8)  The implementation of angle calculation is shown in Fig. 7. A divisor depicted in the figure generates tan(2 ) . By the use of trigonometric identities, cos and sine are derived. They are as under .  The computation of half-angle identities is shown in the left of Fig. 7. Angle calculator is shown in the right of Fig. 7, which consists of a tree of multiplexers to select the correct angle based on whether the number is real, imaginary or complex. The Jacobi matrix generation is shown in Fig. 8. The numerator and denominator of (11) are computed via adder and subtractor and passed to the angle calculator unit. The Jacobi matrices are generated using c = cos( ∓ ) and s = cos( ± ) identities, where and are half angles calculated previously. The identities are implemented with simple vector multiplier. The Jacobi rotations are given by Thus Jacobi rotations are simple matrix multiplications and are implemented by vector multipliers in Fig. 8.
As all the building blocks are described, next is the demonstration of how the three U, S and V matrices are generated. As shown in Fig. 6, SVD consists of DP and non-DP. The quantities used are defined as follows: A_s denotes diagonal submatrix of A, terms with DP mean newly updated submatrix from DP unit and terms with I indicate they are from identity matrix. DP_c and col represent 2 × 2 submatrix the current iteration and currently selected column pairs, respectively. The algorithm is repeated for a minimum number of iterations to achieve convergence. Literature suggests that 6-10 iterations are enough for it. In our case, for a dimension of 32, the iteration factor is 6. The iteration factor is determined from the table given in [21].
The diagonal processor receives a 2 × 2 main diagonal submatrix A_s of matrix A. It has a swap operator (12) cos( ) = 1 √ 1 + tan 2 ( ) , sin( ) = cos( ). tan( ), implemented with a mux. The swap operator swaps the columns to arrange the diagonal elements in ascending order as SVD requires it. After the swap, DP generates the Jacobi matrices and applies the rotations. A_s is rotated by DP_v first and then it is post multiplied by DP_u . Now the nondiagonal elements of A_s are turned to zero. DP now outputs the new matrices DP_s , DP_u and DP_v to the non-DP. Non-DP in Fig. 6  if c diag1 < c diag2 then 6: Swap the columns. 7: end if 8: Jacobi matrices generation: 9: Compute half angles cos α, cos β and sin α, sin β using (11) and (12)  10: G enerate the matrix DP u and DP v by using c = cos α ∓ β and s = sin α ± β identities 11: Jacobi two sided rotation: 12: M atrix S diagonal is computed by applying the Jacobi rotations on A using (13)  A single matrix is used for input A and output U to reduce the number of resources. The critical path is in the DP block because of the SVD 2 × 2 unit, which has the angle calculator. The generated Verilog code can be modified at the RTL level to pipeline the architecture as the critical path is too long. At the cost of resources, frequency and latency optimization is employed. The 2 generations and rotation blocks are used in parallel and pipelined with the initiation interval of 8.
In summary, initially, Jacobi's left and right rotations are generated. They are applied to original matrix A to obtain S. Left rotations on identity matrix produce Matrix U while the right rotations produce matrix V T . But this approach can only be applied to symmetric matrices. If the matrix is not symmetric, then a further step is needed to symmetrize the matrix. The symmetrization can be done with the help of Givens rotations.

Histogram of gradients (HOG)
The architecture suggested relies on the implementation provided by the authors of [25]. The 2D grayscale image is taken as input. Gradients p x and p y are calculated for each pixel in both x and y directions, respectively, using two subtractors. The gradient magnitude and direction are given by The angle (0-180) is used to assign a gradient magnitude to 9 different bins [24]. The histogram is built from these bins. To save resources angle calculation, i.e. the orientation of the gradients is computed using integer multiplications instead of division. This improvement is achieved by performing angle calculation and bin assignments together. If the angle lies between any one of the nine available slots, then that gradient magnitude is assigned to the corresponding bin. As demonstrated by authors of [25], the gradient magnitude is assigned to bin one if the following inequality holds: This step simply involves a multiplication of integer constant to satisfy the inequality. This step helps save resources because it avoids a calculation of tan( ) , which involves divisions. Based on the orientation, the gradients are assigned to 9 different bins. They are the contrast insensitive bins. Afterwards, they are aggregated for smoothening among all bins. At last, they are normalized using L1-norm. As compared to the L2-norm, L1-norm avoids squaring. Also, instead of dividing the reciprocal is multiplied. This improvement further saves hardware resources without sacrificing accuracy too much. To obtain Felzenszwalb's HOG, two extra steps need to be performed. First bin assignment step is performed again. This time the gradients are assigned to 18 different bins based on orientation (0-360). They are the contrastsensitive bins. These 18 bins from each block are averaged together. Also, the previously calculated 9 bins are averaged for each block. For 4 blocks, the 9 normalized bin elements are also averaged. These 18 directional, 9 non-directional and 4 normalization bins form the 31 third dimensional features of fHOG. After this step, the output is assigned. The implementation diagram is demonstrated in [25].

Implementation results and discussion
The data path of the proposed DSST system is specified using the Vivado HLS tool. The prototyping is performed on Zedboard with Xilinx Zynq xc7z020clg484-1 (20). System-on-Chip. The block-wise implementation results are discussed as follows.

Discrete Fourier transform
DFT 1D has maximum size N = 320 . For DFT, the results are compared with [15]. Vivado HLS post-synthesis timing and resources results, referred to as HLS, are shown in Tables 2, 3 and 4. HLS timing results in Table 2 outperform the one in [15] by factor of 92%. This improvement is because of using precalculated twiddle factors. Also, in contrast to the implementation in [15], our implementation is extensively unrolled. Two different implementations are provided here, serial S_HLS and parallel P_HLS. Resources in Table 3 indicate that our serial DFT uses less LUTs than the one in [15] but uses twice the DSP48E units. The higher number of resources is because of separate hardware for imaginary and real parts. Also, for 8 parallel DFT, our resource consumption is much higher than [15]. The higher resource consumption is justifiable because optimization is done for performance in the case of higher dimensions. The maximum operating frequency for 8 parallel DFT is 112 MHz while [15] operates at 50 MHz. The frequency can be further improved by pipelining the Verilog code generated at the RTL level. It cannot be improved in Vivado HLS as the pipeline commands unroll the architecture in the scope available. So only inner loops are pipelined. The simulation results were compared against MATLABgenerated golden matrices. The relative percentage error is computed by E = , where E is the error. The golden matrices were generated using intermediate values of the algorithm at the input of each block. This is then applied to HLS-based units. Afterward, the outputs of both are compared. The average error values, in this case, were 6.52 and 6.9 for real and imaginary matrices, respectively. This is because MATLAB uses double-precision values. Our results are still approximate enough because double precision in hardware will consume four times more resources. Also, the latency will be much slower. DFT2 is implemented for maximum dimensions of 320 × 320 . The results for DFT and DFT2 are reported in Table 4. It has a maximum frequency of 112 MHz.

QR factorization
This unit has the maximum dimensions of 800 × 17 . For QR, the results are compared with [20] where implementation is for a 4 × 4 matrix. QR [20] is fully unrolled and uses fixed-point iterations while the approach is based on floating-point. The comparison can be made from the 32-bit fixed-point version. A comparison of the timing results from Table 5 indicates that HLS-based approach takes 4 times more clock cycles than the one in [20]. But HLS based area results are significantly better. This approach takes 2.3 times less DSP48E resources as this is not the critical block, so resource optimization was our target. The operating frequency is almost similar to the non-pipelined version. Our approach is generic while the one in [20] is a fixed size. The worst-case delay for an input of size 800 × 17 is 337 µs for QR economy. At the same time, it consumes 26 DSP48Es resources. The maximum frequency is 116 MHz. The error is calculated using the same procedure as in Sect. 4.1. The average error value is 0.034 for matrix Q as only matrix Q is needed for the DSST algorithm. Our results are approximate enough because double precision used by MATLAB in hardware will consume four times more resources.

SVD
This unit is implemented for maximum dimensions of 32 × 32 . For SVD, the results are compared with [21,22]. The results are compared for low values of N with [21] and for higher values with [22]. Authors of [21,22] use fixed-point iterations while we have a floating-point implementation. Comparison of the timing results from Table 6 indicates HLS-based approach performs better than both implementations for all dimensions except for N = 4 with [21]. Timing is improved by nearly a factor of 3.7 and 4.8 as compared to [22] and [21], respectively. But HLS based area consumption is 2 times higher, as shown in Table 7. The reason being our angle calculation unit uses division and the square root of floating-point numbers, while [21,22]

Histogram of gradients (HOG)
The maximum dimensions of this unit are 320 × 240 . The results of [25] are reported here. The maximum operating frequency is 270 MHz. 60 fps for an image size of 1920 × 1080 . It consumes only 12 DSP blocks. Table 8 shows the maximum resources used by the units in terms of DSP48E, BRAM, FF and LUTs. Also, the maximum power for each unit is displayed. Power is reported using power reports of post place-and-route from VIVADO HLS. The values are for the maximum dimensions of each unit. The speed/frames per second (fps) is calculated for each unit separately. For each block, an average fps is considered using a range of sizes as input. The fps is calculated as fps = F max Cycles , where cycles is the number of clock cycles required to process one frame. Table 8 shows the mean fps of each unit. As for the whole architecture, there are four stages, namely, Scale Search (SS), Scale Filter (SF), Translation Search (TS), and Translation Filter (TF). Table 9 shows the mean fps of each stage.Total fps for a stage is computed by adding reciprocal fps of units involved in the  For the input and output images, (hls::Mat) from HLS is used. They are transformed into an 8-bit integer matrix for processing. Table 10 shows the comparison of our whole HLS architecture against other tracker implementations. Our frame rate is equal to the original DSST [9], but it is less than the fDSST tracker. In terms of power HLS-based solution is much better as the fDSST runs on Xeon CPU and uses more resources. Another correlation filter-based tracker [27] reported is for IoT applications and is implemented for edge devices. It relies on the server for computations, which helps it speed up then our HLS based approach. Again in terms of power our solution is much better. The authors of Rotation Aware DSST tracker [28] use DSST but also integrate rotation awareness for accurate scale estimation. Our solution dominates both in terms of power and speed the RADSST [28]. MOSSE [29] is not DSST-based tracker but relies on programmable logic. It has a higher speed at 4K resolution, but it is reported to compare power. Again for power HLS solution is better. The best fps is of Moving Target Tracker MTT [30] which uses SIMD based DSP platform. It also uses 4GB of RAM. So in terms of power, resources and portability our solution is better. In comparison to these trackers, HLS based solution consumes less power. The speed is comparable to some of the trackers, but the fewer resources make it feasible to operate on the field.

Conclusions
In this paper, the RTL level implementation of the major blocks of Discriminative Scale Space Tracker (DSST) [9] is presented. The implementation is given in terms of the major mathematical operations involved including SVD, QR, DFT2 and HOG extractor. DFT is implemented by 8-parallel architecture; this is the base for DFT2 unit. This approach improves the timing by 92% with increased resources. For QR, the resource consumption is improved by a factor of 2.3 compared to [20]. For SVD, the timing is improved by a factor of nearly 3.8 compared to [22]. For an image size of 320 × 240 it is able to fit in Zync Zed board with a mean frame rate of 25.38 fps, thus can be operated as a stand-alone unit. Future research can be performed at further optimizing the operations involved. The effort can also be made to integrate the whole algorithm implementation and interface with a real camera to test it on the field. Finally, the overall accuracy can be compared to the databases available.
Funding Open access funding provided by Politecnico di Torino within the CRUI-CARE Agreement.

Conflict of interest
The authors have no conflicts of interest to declare and no funding was received for conducting this study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.