# High Performance FPGA-oriented Mersenne Twister Uniform Random Number Generator

- 451 Downloads
- 3 Citations

## Abstract

Mersenne Twister (MT) uniform random number generators are key cores for hardware acceleration of Monte Carlo simulations. In this work, two different architectures are studied: besides the classical table-based architecture, a different architecture based on a circular buffer and especially targeting FPGAs is proposed. A 30% performance improvement has been obtained when compared to the fastest previous work. The applicability of the proposed MT architectures has been proven in a high performance Gaussian RNG.

### Keywords

Mersenne-Twister URNG FPGA Monte Carlo## 1 Introduction

The growing integration densities of current technologies allows configurable logic to be used as powerful platforms for hardware acceleration. This is the case for numerous Monte Carlo simulations where the models of the system under study are very complex. Parallelism is intrinsic to Monte Carlo as it is based on the replication of the same model with different underlying variables, which are simulated with random numbers, making them ideal for parallel architectures.

In this context, uniform random number generators (URNG) play a key role when developing hardware accelerators for Monte Carlo simulations. URNGs are the base element for any random number generation, like Gaussian RNGs (GRNGs) which are widely used in this type of simulations. Any URNG used as basic core in a hardware accelerator must provide random samples with a throughput that does not limit the frequency of the whole Monte Carlo simulation. Furthermore, it should require as few resources as possible because resources may be required by the model to simulate or for replicating the whole Monte Carlo system. Finally, it should provide high-quality random samples as the quality of the random numbers directly impacts on the result accuracy.

Previous works on URNGs using FPGAs can be split into two fields, the development of specific URNGs for FPGAs, and the adaptation of software URNGs to FPGAs. In the first field we can highlight several works from Thomas et al. as [1, 2], where different URNGs are proposed, studied and developed featuring the specific resources and architecture of FPGAs.

Here we concentrate on the second field, the adaptation of software URNGs, due to practical issues like compatibility and application debugging. When developing a hardware accelerator a desired feature is the complete compatibility between the original software application and the accelerated one. Furthermore, the complete compatibility eases the hardware application debugging. Compatibility implies obtaining the same sequence of random numbers. In this work we focus on a very well known generator, the Mersenne Twister [3] which is broadly used in computational science based on Monte Carlo simulations due to its very high quality, superb period and high performance [3].

Previous works in this field [4, 5, 6] are based on the most used Mersenne Twister configuration for 32 bit samples, the MT19937, due to its high quality and the simplifications introduced by its set of parameters. However, none of these previous implementations fulfills the desired characteristics mentioned above. Sriram and Kearney [4] and Chandrasekaran and Amira [5] present a slow clock rate, while in [6] the first part of the algorithm, the initialization, is not implemented in hardware, being thus an incomplete generator. It is an important issue because if the initialization phase has to be carried out in software it complicates the interface with the hardware accelerator (the whole initialization table must be transferred to hardware instead of just the seed).

- 1.
All in hardware.

- 2.
Capable of generating one sample per cycle.

- 3.
Highly efficient in area and performance.

The proposed architectures take advantage of the FPGA structure and resources. Following this idea, in addition to the classical memory table implementation a new architecture is proposed, specifically designed to avoid the use of FPGA internal RAMs as these resources become essential for some simulation models.

Finally, the proposed MT architectures have been used in a high performance GRNG to validate their applicability and study the area cost they imply.

The paper structure is as follows: Section 2 briefly summarizes the Mersenne Twister algorithm. In Section 3 the proposed hardware architectures are exposed, while their experimental results and integration within a GRNG are analyzed in Section 4. Finally some conclusions are drawn.

## 2 Mersenne Twister Algorithm

*w*bits. The algorithm is split into three different tasks:

- 1.
Initialization: generation of the first

*n*vectors of the recurrence from a seed. - 2.
Obtaining the linear recurrence.

- 3.
The

*tempering*of the generated variables from the linear recurrence.

*n*variables of

*w*bits. This initialization takes the seed as the first element of the recurrence,

*x*

_{0}, while the other

*n*− 1 variables are generated following:

*n*variables, the next random numbers are calculated following equation:

*k*= 1, 2,...,

*m*points to a position in the working area (1 ≤

*m*≤

*n*),

*A*is a matrix, \(x_k^u\) stands for the (

*w*−

*r*) most significant bits of

*x*

_{k}and \(x_{k+1}^l\) means the

*r*less significant bits of

*x*

_{k + 1}.

*tempered*, with a bitwise multiplication by a

*w*×

*w*binary matrix (

*T*):

*z*the output of the generator.

### 2.1 MT19937 Algorithm Simplifications

The complexity of the algorithm and its computational load is mainly due to the two matrix multiplications in Eqs. 1 and 2. However, both multiplications are greatly simplified with a correct selection of the elements of the matrixes. This is what happens with the MT19937 set of parameters.

*A*follows the form:

*a*is the

*w*

^{th}row of matrix A.

*tempering*matrix multiplication, again the matrix T is selected in such a way that this multiplication is simplified into several logical bitwise operations:

As just seen, MT URNG depends on multiple parameters (w, n, m, r, a, u, s, b, t, c, l) corresponding MT19937 to the set (32, 624, 397, 31, 9908BODF, 11, 7, 9D2C5680, 15, EFC60000, 18).

## 3 Hardware Architecture

*n*×

*w*work area of the linear recurrence. Not depicted in the figure, a small control logic is also needed to handle the two possible scenarios, the initialization and the linear recurrence with the tempering.

The hardware must be able of generating one sample per cycle at a high clock rate. If we focus on the linear recurrence and the tempering both tasks can easily fulfill the criteria of obtaining one sample per cycle while achieving a high clock rate. The matrix multiplications reduction to bitwise operations ensures fast datapaths as these operations perfectly suit FPGA technology. Furthermore, due to the depth of the work area and the dependencies among samples, the logic of both tasks can be pipelined to increase the clock rate.

The initialization task, besides its bitwise operations, also requires a 32 bit multiplication and a 32 bit addition. These two operations compose a slow datapath with much more logic than the other two tasks. Although initialization stops working once the first *n* samples are generated, the clock rate of the MT generator is also determined by this logic and therefore, a second pipeline level is required.

*n*×

*w*work area of the linear recurrence. There are two suitable options: a storage table, Fig. 2a, which is the solution adopted in previous works, and a circular buffer, the solution proposed in this work, see Fig. 2b.

In the first case, a three port storage table with two read and one write ports is needed (3P_Table from now on). In an FPGA, this three port table has a direct translation into two dual port tables implemented by embedded Block-RAMs plus the logic required for updating the indexes for the table addresses.

A second option is the use of a circular buffer (CB) of registers taking advantage of the fixed relationship between the indexes of the words and considering that each step of the recurrence *x*_{k} is replaced by *x*_{k + n} in the work area. This way, the linear recurrence (L. R. in Fig. 2b) and the buffer of registers can be considered as a circular buffer where the linear recurrence is carried out by some combinational logic between the input and the output of the buffer. Hence the architecture is simplified as no logic for the table indexes is needed.

## 4 Implementation Results

Outstanding maximum speeds have been achieved, 418.6 MHz for Virtex5, and 345.9 for Virtex4. More in detail, and for the same FPGA (Virtex4 F×100), the 3P_Table architecture outperforms the fastest previous work clock rate [6] in a 30.5%, even though that work did not implement the initialization task. It can be also seen that performance does not come at the cost of resources. The increase of slices (128 to 183) and DSPs (0 to 3) is mainly due to the initialization stage and not to the performance improvement.

### 4.1 Architectures Comparison

The main reason to select between storage table or circular buffer of registers is the resource usage, as the Block RAMs required in the 3P_Table architecture become logic slices for the CB implementation. However, this effect is highly related to the FPGA family and the model selected, as will be analyzed next.

Regarding Virtex4 devices, four BRAMs are required (1.06% of the total BRAMs), whose replacement increases in 682 the required slices (1.61% of the total). For Virtex5, the CB architecture benefits from the increase of LUT inputs (six with respect to four in Virtex4) that allows the implementation of 32-bit SRL instead of the 16-bit SRL of Virtex4. Thereby, the implementation of the work area in logic is drastically more compact in Virtex5. Meanwhile, 3P_Table implementation also benefits from the increase of Block RAM capacity, now requiring just two BRAMs, representing a 0.44% of the total BRAMs. In this case the increase of slices (87) just represents a 0.28% of the total, being this percentage smaller than the 0.44% of the total BRAMs represented by the two BRAMs.

### 4.2 Using an MT URNG in a Gaussian RNG

As URNGs are base elements for obtaining generators from other distributions, it is desirable to study the impact of the designed MT URNGs when used on other hardware generators providing commonly used distributions in Monte Carlo simulations. This is the case of the Gaussian RNGs (GRNG), so we have included our MT URNGs in a GRNG [7] that previously used a Tausworthe combined generator (Taus88) [8]. The Taus88 URNG presents high performance and very low use of resources, although its quality is not very good [1].

The gaussian generation method of [7] is the inversion method. This method ensures that the transformed distribution, gaussian, inherits the statistical properties of the base RNG, uniform, and therefore it is necessary a high quality URNG to achieve a high quality GRNG.

*CDF*

^{ − 1}) method without URNG. Next rows show information for the whole GRNGs including the three different URNGs. As seen in the table, the impact of using a much better URNG (MT) in the GRNG is not very significant taking into account that most GRNG logic is devoted to the inverse CDF function, which in turn requires a small percentage of the resources of the FPGA. Furthermore, the GRNG working frequency is not limited by the URNG.

Virtex5 F×200—GRNG results.

| Slices | BRAM | DSP | MHz |
---|---|---|---|---|

| 946 (3.1%) | 5 (0.5%) | 10 (2.6%) | 280.5 |

Taus88 | 1024 | 5 | 10 | 280.9 |

MT 3P_T | 966 | 7 | 13 | 281.1 |

MT CB | 1125 | 5 | 13 | 281.4 |

## 5 Conclusions

The Mersenne Twister URNG is an ideal core for Monte Carlo simulation due to its high quality, superb period and high performance.

This work provides two efficient implementations of the MT URNG specifically designed for FPGAs differing both implementations in the storage element selected for the linear recurrence work area. With a careful design of the logic, and a pipelined implementation of the initialization task, our architectures present a 30% performance improvement with respect to the previous fastest work. The proposed MT architectures have been used in a high performance GRNG to prove their applicability.

## Notes

### Acknowledgement

This work has been funded by BBVA contract P060920579 and Cicyt project TEC2009-08589.

### References

- 1.Thomas, D. B., & Luk, W. (2007). High quality uniform random number generation using lut optimised state-transition matrices.
*Journal of VLSI Signal Proccessing, 47*, 77–92.CrossRefGoogle Scholar - 2.Thomas, D. B., & Luk, W. (2010). fpga-optimised uniform random number generator using luts and shift registers. In
*Intl. conf. on field programmable logic and applications*(pp. 77–82).Google Scholar - 3.Matsumoto, M., & Nishimura, T. (1998). Mersenne twister: A 623-dimensionally equidistributed uniform pseudo-random number generator.
*ACM Transactions on Modeling and Computer Simulation, 8*(1), 3–30.CrossRefMATHGoogle Scholar - 4.Sriram, V., & Kearney, D. (2006). An area time efficient field programmable mersenne twister uniform random generator. In
*Intl. conf. engineering of reconfigurable system and algorithms*(pp. 244–246).Google Scholar - 5.Chandrasekaran, S., & Amira, A. (2008). High performance FPGA implementation of the mersenne twister. In
*International symposium on electronic design, test & applications*(pp. 482–485).Google Scholar - 6.Xiang, T., & Benkrid, K. (2009). Mersenne twister random number generation on FPGA, CPU and GPU. In
*NASA/ESA conf. on adaptative hardware and systems*(pp. 460–463).Google Scholar - 7.Echeverría, P., & López-Vallejo, M. (2007). FPGA gaussian random number generator based on quintic hermite interpolation inversion. In
*IEEE intl. midwest symposium on circuits and systems*(pp. 871–974).Google Scholar - 8.L’Ecuyer, P. (1996). Maximally equidistributed combined Tausworthe generators.
*Mathematics of Computation, 65*(213), 203–213.MathSciNetCrossRefMATHGoogle Scholar