An optimization of the FPGA based wavelet trigger in radio detection of cosmic rays

For the observation of ultra high-energy cosmic rays (UHECRs) by the detection of their coherent radio emission an FPGA based wavelet trigger is being developed. Using radio detection, the electromagnetic part of an air shower in the atmosphere may be studied in detail, thus providing information complementary to that obtained by water Cherenkov detectors which are predominantly sensitive to the muonic content of an air shower at ground. For an extensive radio detector array, due to the limited communication data rate, a sophisticated self trigger is necessary. The wavelet trigger investigating online a power of signals is promising, however its implementation requires some optimizations. The digitized signals are converted from the time to frequency domain by a 32-point FFT procedure, then multiplied by wavelet transforms. Altera® FFT routines convert ADC data as blocks of 2N samples. FFT coefficients are provided in a serial stream in 2 time bins. An estimated signals power strongly depends on relatively positions of the FFT(data) and the wavelet transforms in a frequency domain. Additional procedure has to calculate a most efficient selection of the sample block to reach a response corresponding to a maximal signal power. If a set of FFT coefficients were available in each clock cycle, the signal power could be estimated also in each clock cycle and additional tuning procedure would not be necessary. The paper describes an implementation of the 32-point FFT algorithm into Altera® FPGA providing all 32 complex DFT coefficients for the wavelet trigger in each clock cycle as well as a resource occupancy, timing and a power consumption for several variants implementing up to 12 wavelet engines. Measurements on an Altera®'s development kit fully confirmed our expectation based on simulated configurations. The presented results give a green light for a development of the Front-End Board prototype based on the newest Cyclone® V FPGA with the wavelet trigger for radio detection of cosmic rays.


Zbigniew Szadkowski,
Abstract-For the observation of ultra high-energy cosmic rays (UHECRs) by the detection of their coherent radio emission an FPGA based wavelet trigger is being developed. Using radio detection, the electromagnetic part of an air shower in the atmosphere may be studied in detail, thus providing information complementary to that obtained by water Cherenkov detectors which are predominantly sensitive to the muonic content of an air shower at ground. For an extensive radio detector array, due to the limited communication data rate, a sophisticated self trigger is necessary. The wavelet trigger investigating online a power of signals is promising, however its implementation requires some optimizations. The digitized signals are converted from the time to frequency domain by a 32-point FFT procedure, then multiplied by wavelet transforms. Altera R FFT routines convert ADC data as blocks of 2 N samples. FFT coefficients are provided in a serial stream in 2 N time bins. An estimated signals power strongly depends on relatively positions of the FFT(data) and the wavelet transforms in a frequency domain. Additional procedure has to calculate a most efficient selection of the sample block to reach a response corresponding to a maximal signal power. If a set of FFT coefficients were available in each clock cycle, the signal power could be estimated also in each clock cycle and additional tuning procedure would not be necessary. The paper describes an implementation of the 32-point FFT algorithm into Altera R FPGA providing all 32 complex DFT coefficients for the wavelet trigger in each clock cycle as well as a resource occupancy, timing and a power consumption for several variants implementing up to 12 wavelet engines. Measurements on an Altera R 's development kit fully confirmed our expectation based on simulated configurations. The presented results give a green light for a development of the Front-End Board prototype based on the newest Cyclone R V FPGA with the wavelet trigger for radio detection of cosmic rays.

I. INTRODUCTION
IR ESULTS from various cosmic rays experiments located on the ground level, point to the need for very large aperture detection systems for ultra-high energy cosmic rays. With its nearly 100% duty cycle, its high angular resolution, and its sensitivity to the longitudinal air-shower evolution, the radio technique is particularly well-suited for detection of Ultra High-Energy Cosmic Rays (UHECRs) in large-scale arrays. The present challenges are to understand the emission mechanisms and the features of the radio signal, and to develop an adequate measuring instrument. Electronpositron pairs generated in the shower development are separated and deflected by the Earths magnetic field, hence introduce an electromagnetic emission [1], [2]. During shower development, charged particles are concentrated in a shower disk of a few meters thickness. This results in a coherent radio emission up to about 100 MHz. Short but coherent radio pulses of 10 ns up to a few 100 ns duration are generated with an electric field strength increasing approximately linearly with the energy of the primary cosmic particle inducing the extended air showers (EAS), i.e. a quadratic dependence of the radio pulse energy vs. primary particle energy. In contrast to the fluorescence technique with a duty cycle of about 12% (fluorescence detectors can work only during moonless nights), the radio technique allows nearly full-time measurements and long range observations due to the high transparency of the air to radio signals in the investigated frequency range.
The radio detection technique will be complementary to the water Cherenkov detectors and allows a more precise study of the electromagnetic part of air showers in the atmosphere. In addition to a strong physics motivation, many technical aspects relating to the efficiency, saturation effects and dynamic range, the precision for timing, the stability of the hardware developed, deployed and used, as well as the data collecting and system-health monitoring processes will be studied and optimized [3].
One of the currently developed technique is an radio signals power estimation based on the wavelet transforms.

II. WAVELETS
Fourier transform is useful to analysis the frequency component over the whole time, but it can not catch the change in frequency response with respect to time. Short-Time Frequency transform (STFT) uses a window function to catch the frequency component in a time interval. Although STFT can be use to observe the change in frequency response with respect to time, there is still a problem that the fixed width of the window function lead to the fixed resolution. On the other hand, from the Uncertainty Principle, we know that the product of the time domain resolution and the frequency resolution is constant, so we cannot obtain high resolution at both the time domain and frequency domain at the same time. The wavelet transform is one of the solutions to the above problem: by changing the location and scaling of the mother wavelet, we can introduce a multi-resolution concept. By an implementation of several windows with variable width, the wavelet transform can capture both the short duration, high frequency and the long duration, low frequency information simultaneously. It is more flexible than STFT and particularly useful for the analysis of transients, aperiodicity and other non-stationary signal feature.
Let us investigate a time series X, with values of x n , at time index n. Each value is separated in time by a constant time where the asterisk (*) denotes complex conjugate. The above sum can be evaluated for various values of the scale s (usually taken to be multiples of the lowest possible frequency), as well as all values of n between the start and end dates. It is possible to compute the wavelet transform in the time domain according to (2). However, it is much simpler to use the fact that the wavelet transform is the convolution between the two functions X and ψ, and to carry out the wavelet transform in Fourier space using the Fast Fourier Transform (FFT). In the Fourier domain, the wavelet transform is : Unlike the convolution, the FFT method allows the computation of all n points simultaneously, and can be efficiently coded using any standard FFT package. Wavelets coefficients allow an estimation of signal power. The global wavelet spectrum, defined as the time average over a series of p-wavelet powers, can be expressed as [10]: A sum of products of Fourier coefficients calculated in a FFT32 routine for ADC data data (x n ) in each clock cycle with pre-calculated Fourier coefficients of reference wavelet gives an estimation of power for selected type of the wavelet. Only a single FFT32 routine for on-line calculation of Fourier coefficients for data is needed. Fourier coefficients for various wavelets can be calculated earlier and be available for final power estimation as constants.
A fundamental limitation for wavelet analysis on-line in the FPGA is an amount of embedded DSP multipliers. Multiplication by an utilization of logic elements is rather inefficients.
There are many functions chosen to be the mother wavelet: Haar, Mexican Hat, Morlet, Daubechies, Meyer and others [4], [5]. A selection of the mother wavelet should be correlated with a most frequent shapes of radio signals to get the maximal efficiency of the wavelet trigger approach. A preliminary analysis suggests that the potentially most promising wavelet families are Morlet (4) and Mexican Hat (5) ones.

III. ALTERA R FFT LIBRARY FUNCTIONS
Commercially offered FFT processors for FPGA applications require several clock cycles to accomplish calculation of all complex DFT coefficients. Each stage of the decomposition typically shares the same hardware, with the data being read from memory, passed through the FFT processor and written back to memory. Each pass through the FFT processor is required to be performed log r N times. Popular choices of the Radix are r = 2, 4, and 16. Increasing the Radix of the decomposition leads to a reduction in the number of passes required through the FFT processor at the expense of device resources. Such an approach is very widely useful for many applications, where timing is not crucial. However, there are areas, where the FFT coefficients (based on a new set of samples) have to be known in each clock cycle. Commercial FFT processors, unfortunately, cannot be used. This approach requires special algorithms optimized for a particular solution [6], [7], [8].
Quartus R II environment for an Altera R FPGA programming provides parametrized FFT routines with various architectures: streaming, variable streaming, burst and buffered burst. For the variable streaming provides also fixed or floating-point algorithms with natural or bit-reversed order. However, all routines deliver the FFT coefficients in a serial form. No any Altera R routine allows calculating all FFT coefficients simultaneously.
If FFT coefficients are spread in time, the wavelet transform can be also calculated in a serial way (in a single clock cycle only a single pair ofX n is multiplied by a single pair of ψ * ), however, a product will strongly depend on a relative position ofX n and ψ * . If the variables are shifted between themselves, even strong signal may give a negligible final contribution. Some additional procedure is needed, which could tune a wavelet transform regarding to the Fourier transform of ADC samples.
This problem can be automatically solved if all Fourier coefficients were provided simultaneously in each clock cycle. A synchronous multiplication with Fourier coefficients of wavelets would give required power estimation independently of any relatively configurations of these variables. The Fourier coefficients of selected wavelets are fixed, a sliding window of N ADC samples gives all Fourier coefficients in each clock cycle. This assures that for some set of samples (if a signal appears) the product of both transforms may give a significant contribution and may be used as a trigger.
The radio signal is spread in time interval of an order of couple hundred nanoseconds, most of registered samples gave a time interval below 200 ns. The frequency window in the atmosphere, where a signal suppression is on an acceptable level (the atmosphere is relatively transparent) is ca. 30-80 MHz. According to the Nyquist theorem the sampling frequency should be twice higher than the maximal frequency in an investigated spectrum. The anti-aliasing filter should have the cut-off frequency of ca. 85 MHz. Taking into account some width of the transition range for the filter (from pass- x n e −iπkn/16 where x n as samples from an ADC chip are real. The formula (6) can be split on two or more parts by rearranging of the sum and indices. The standard approach of a formula simplification is a Radix-2 Decimation-in-Time (DiT) (Fig. 1) or Decimation-in-Frequency algorithm (DiF) (Fig. 2) one. For Radix-2 DiT, we get : x n e −2iπkn/N = (7) N-point DFT can be easily split on two N/2-point transforms. Outputs from DFT procedures are complex. So, a calculation of final DFT coefficients by using DiT algorithm requires complex multiplication for final merging data from parallel DFT procedures with lower order (i.e. multiplication of twiddle factors W k N : by G[k] and H[k] in Fig. 1. Altera R provides a library routine of a complex multiplication in the FPGA (Fig. 3), however, for i.e. 16x16 bits operation requires 6 DSP embedded 9x9 multipliers even in most economical (canonical) mode. Generally, a complex multiplication in the FPGA is rather resource-spendthrift and if possible it should be replaced by a multiplication of real variables. For Radix-2 DiF, we get : The standard Radix-2 Decimation-in-Frequency algorithm (DiF) rearranges the DFT equation (6) into two parts: computation of the even-numbered discrete-frequency indicesX (k) for k=[0,2,4,. . .,30] and computation of the odd-numbered indices k=[1,3,5,. . .,31]. This corresponds to a splitting N-point DFT into two k = N/2-point routines. The first corresponding twiddle factor is e −i 2π N N 2 = −1. The first operations are simple sums and subtractions of real variables. Each operation related to the consecutive twiddle factor will be performed in a single clock cycle. The algorithm of Decimation in Frequency used for the 32-point DFT allows splitting eq. 6 as follows: where for n = 0,1,...,15 The next twiddle factors are: where γ = cos(π/4) (18) α = cos(π/8) β = sin(π/8) (19) ξ = cos(π/16) η = sin(π/16) (20) σ = cos(3π/16) ρ = sin(3π/16) The scheme developed on a pure Radix-2 Decimation in Frequency algorithm is presented on Fig. 5. The algorithm takes into account only FFT coefficients with indices k = 0,...,15. Due to real input data (x 0,...31 ) the higher FFT coefficients have well known symmetry : ReX 32−n = ReX n and ImX 32−n = −ImX n (n > 0). Calculation ofX 0,...15 according the pure Radix-2 DiF algorithm requires 8 pipeline stages. ForX 0,4,8,12,16 2 pipeline stages are necessary only for a synchronization.
According to the eq. (10) allX 0,2,4,...,14 with even indices could be calculated by the algorithm presented in [6]. Variables x n in Fig. 2 in [6] were replaced by variable of A n according to eq. (12). An application of a modified algorithm reduces an amount of 9 × 9 multipliers from 12 to 10 only and shorten a pipeline chain on stages (the last 2 stages are simple registers for synchronization) .
Let us notice that for the odd indices stages B and C for k=16,...,19 and k = 24,...27 are pure delay lines, while for neighboring indices k=20,...,23 and k = 28,...31 mathematical operation are performed in a cascade. Let us multiply A 16,...19 and A 24,...27 by the factor λ = γ −1 . Then to adjust variables in the C stage for odd FFT coefficients (for k = 20,21,22,23 and k = 28,29,30,31) So, by such a redefinition C stage for odd FFT indices is a pure pipeline stage. It can be removed with one of pipeline stage for the even FFT indices. In order to come back to the correct values coefficients in F stage can be simple redefined but for indices k = 16,20,24,28 we have to use additional 4 multipliers. Nevertheless, at this cost we save one pipeline stage and depending on a width of buses in final FFT coefficients we should save ca. 640 logic elements (32 registers with 20-bit width). In order not to lose an accuracy in calculation, the width of the variables (and registers) increase by one in successively pipeline stages. Assuming 12-bit ADC data, we get 12-bit data bus in shift registers x k (Figures 5 -6). The bus width increases to 13, 14,..., 20 in A, B,.., last routine, respectively. We can save a next pipeline stage and more logic elements but again at the cost of additional utilized multipliers. The algorithm used for indices k = 2,6,10,14 is neither Decimation in Time nor Decimation in Frequency. The eq. (11) can be rewritten as follows: Development of the algorithm according to eq. (22) would allow a reduction of the next pipeline stage, but unfortunately at the cost of additional 16 ALTMULT ADD routines (64 DSP blocks) (see Fig. 4).
If the speed is not a factor, sums of products in the E bin routine can be performed in a single clock cycle instead of two cycles as shown on Fig. 6. Thus, D 16,20,24,28 shift registers are not necessary and can be removed. A shorter chain for the odd indices allows removing also the last pipeline chain for Fig. 4. The ALTMULT ADD procedure provided by Altera. For a calculation of |W k | 2 , dataa 0 = datab 0 and dataa 1 = datab 1. This routine requires 4 DSP 9 × 9 multipliers. It is used in E bin pipeline stage for odd FFT indices (Fig. 6). Inputs dataa 0, 1 are used for C k , datab 0, 1 for constants α, β, ξ, η, σ and ρ. The routine requires two clock cycles. Sub-products are registered in MULT0 and MULT1 DSP blocks, respectively. Thus, the sum in the next register stage.  even indices and saving totally more than 1000 logic elements without the cost of additional multipliers. However, we should be aware, that a registered performance significantly decreases from ca. 220 MHz to only 158 MHz for EP3C120F780C7.

V. WAVELET POWER CALCULATION AND MEASUREMENTS
The reference wavelets are real, however, their Fourier transform are already complex. Elementary product from eq. (3) is a product of two complex numbers: Fourier coefficients of data and Fourier coefficient of a reference wavelet. The simplest way is to use the Altera R routine from Fig. 3. However, due to a fact that the wavelet Fourier coefficients are predefined constant and finally we are going to calculate a module of a complex product as well as |W × Ψ| 2 = |W | 2 × |Ψ| 2 , we can calculate only |W | 2 and next as real number multiply by a next real |Ψ| 2 .
The biggest FPGAs from the Cyclone R III (EP3C120F780C7) and Cyclone R IV (EP4CE115F29C7) families with 576 and 532 DSP multipliers, respectively, allow implementation FFT32 routine (96 DSP blocks) + "Module" block (60 DSP blocks) + 14 or 11 "engines" (30 DSP blocks each) simultaneously for a power estimation of 14 or 11 various reference wavelets, respectively. Table I shows results calculated and measured in the Altera R 's development kit DK-DSP-3C120N for various variants for Cyclone III EP3C120F780C7 (a heart of this development kit). Results do not fully agree with our expectations.
A reduction of a single pipeline stage decreases a resource occupation on ca. 410 (not 640) logic elements. This may be due to optimization processes performed by the Quartus R II compiler to achieve the maximal registered performance. Nevertheless, for all comparisons the speed in "optimized" design is higher than for the "pure DiF". For a development of wavelet engines the "optimized variant has been selected as potentially faster.
The Quartus R II compiler estimated a power consumption for the core, a static mode and for the I/O sector. As possible, the output of registers were multiplexed to reduce an amount of output pins (all pins were achieved to HSMC connectors on the development board). According to expectation the power for I/O increase ca. linear with number of used pins. The static power consumption is on a level ca. 100 mW. It is a reasonable level. In comparison the Stratix R III chips have a huge power consumption in a static mode of ca. 600 mW, which significantly limited their application in systems supplied from solar panels. The power consumption for the "optimized" variant is ca. 35 mW higher than for the "pure DiF" solution. The additional 35 mW is not a factor, if it allows an improvement of the safety margin for the register performance. The EP3C120F780C7 allows implementation of 14 wavelet engines. A design with 12 engines has been tested. The power consumption is on a level of ca. 100-110 mW per the wavelet engine. It gives ca 2 W for 12 engines. This may be a challenge for an autonomous system supplied from solar panels.
Measurements of the power consumption for all considered variants show some discrepancies with simulations. Measures power consumption for the core increases slower with new wavelet engines than simulations show. Almost 300 mW lower Fig. 5. An internal structure of the FFT32 FPGA procedure. The algorithm uses 14 single clock-cycle multipliers (i.e. F 7 = γD 7 -each utilizes two 9x9 DSP multipliers) and 16 two clock-cycles multipliers (i.e. N 7 = βG 7 − αH 7 -each utilizes four 9x9 DSP multipliers). Totally, the algorithm needs 92 9x9 DSP multipliers. power taken by the FPGA (in comparison to simulations) for 12 engines gives optimistic predictions for the future applications. Power consumption for the core seems to be ca 15% overestimated in simulations. On the other hand, power consumption for the I/O section is unpredictable much higher than for simulations. However, differences decrease with higher amount of active pins. This, actually, is not a problem, I/O pins have been attached for test only. In real applications almost all variables are utilized as internal nodes. A power optimization is highly recommended. Designs have been also implemented into EP4CE115F29C7 from the Cyclone R IV family of Altera R used in a development kit DE2-115 (Terasic). According to the Altera R 's specification, the power consumption for the Cyclone R IV family is 30% less than for the Cyclone R III one. However, the Terasic's development kit does not contain any system allowing a measurement of the power consumption on the board.
For the Cyclone R IV EP4CE115F29C7 timing shows a pretty good safety margin.

VI. SPECTRAL LEAKAGE
For serial FFT processing the input data have to be chopped into blocks to be processed by the FFT routine. If signal pulses are located close to the border of a block, aliasing occurs. It manifests by a spurious contribution in the opposite border of the block and in the neighboring block as well. This effect may cause spurious pulses and has to be eliminated. The problem can only be solved, without introducing dead time between the blocks, by using an overlapping routine. Therefore the FFT engines have to be overclocked. Practically for 1024length blocks aliasing is reduced to a negligible level, when two blocks are overlapped during 64 time bins [3]. For parallel data processing, when all set of Fourier coefficients is available for each clock cycle aliasing can be eliminated by a selection of a set of these coefficients not significantly affected.
If a reduced set of Fourier coefficients is taken for data analysis, there is a possibility to increase an amount of wavelet engines for simultaneously analysis of more reference wavelets.

VII. DESIGN IMPROVEMENT
The new Altera's FPGA family -Cyclone R V provides the industry's lowest system cost and power, along with performance levels that make the device family ideal for differentiating your high-volume applications. A total power consumption compared with the previous generation (Cyclone R IV) is reduced up to 40%.
The biggest FPGA from the Cyclone R V E family 5CEA9 (with logic only without ARM-based hard processor system (HPS) contains 684 DSP 18 × 18 multipliers + 342 variableprecision DSP blocks (DSP blocks include three 9x9, two 18x19, and one 27x27 multiplier). Assuming roughly a single 18 × 18 multiplier is equivalent to two 9 × 9 ones, 5CEA9 could implement FFT32 + 18 engines for various 18 reference wavelets. However, the 5CEA9 FPGA is not yet available even for compilation (latest Quartus R II version 12.0). An estimation for 12 wavelet engines for 5CGXFC7 FPGA shows the scarcity of DSP blocks. Fast multipliers are replaced by logic elements, which significantly reduced the register performance for slow models, below our requirements. Nevertheless, if all multiplication all implemented in the fast DSP blocks (see Table II Cyclone R V for 4 wavelet engines only), timing is perfect. This allows anticipating also a perfect timing for the 5CEA9 chip. Expected total 58% less power consumption (30% and next 40% of reduction of power consumption from Cyclone R III to Cyclone R V) gives an estimation of 840 mW for 12 and 1260 mW for 18 wavelet engines, respectively. It is acceptable level of the power consumption for currently used supply systems in cosmic rays experiments.

VIII. CONCLUSION
The FFT32 routine has been successfully and costeffectively implemented into the powerful FPGA EP3C120F780C7 from the Cyclone R III family used in a development kit DK-DSP-3C120N (Altera R ) [11] and EP4CE115F29C7 from the Cyclone R IV family of Altera R used in a development kit DE2-115 (Terasic) [12], [13] .
Nevertheless, both FPGAs from Cyclone R III and IV families were treated as an engineering test platform for a development of the algorithm and a timing verification. The prototype targeted for real detection of radio signals coming from air showers developing in the atmosphere will be built on a basis of Cyclone R V family.