Bringing Flexibility to FPGA Based Pricing Systems

  • Christian Brugger
  • Christian De Schryver
  • Norbert Wehn

Abstract

High-speed and energy-efficient computations are mandatory in the financial and insurance industry to survive in competition and meet the federal reporting requirements. While FPGA based systems have demonstrated to provide huge speedups, they are perceived to be much harder to adapt to new products. In this chapter we introduce HyPER, a novel methodology for designing Monte Carlo based pricing engines for hybrid CPU/FPGA systems. Following this approach, we derive a high-performance and flexible system for exotic option pricing in the state-of-the-art Heston market model. Exemplarily, we show how to find an efficient implementation for barrier option pricing on the Xilinx Zynq 7020 All Programmable SoC with HyPER. The constructed system is nearly two orders of magnitude faster than high-end Intel CPUs, while consuming the same power.

8.1 Introduction

The recent advance in financial market models and products with ever increasing complexity, as well as the more stringent regulations on risk assessment from federal agencies have led to a steady growth of computational power. Additionally, increasing energy costs force finance and insurance institutes to consider new technologies for executing their computations. Graphics Processor Units (GPUs) have already demonstrated their benefit for speeding up financial simulations and are state-of-the-art in finance business nowadays [2, 21].

However, Field Programmable Gate Arrays (FPGAs) have been shown to outperform GPUs with respect to speed and energy efficiency by far for those tasks [6, 15, 17]. They are currently starting to emerge in finance institutes such as J.P. Morgan [1, 7] or Deutsche Bank [12]. Nevertheless, most problems cannot be efficiently ported to pure data path architectures, since they contain algorithmic steps that are executed best on a Central Processing Unit (CPU).

Hybrid devices as shown in Fig. 8.1 combine standard CPU cores with a reconfigurable FPGA area, connected over multiple high-bandwidth channels. They allow running an Operating System (OS) that is able to (re-)configure the FPGA part at runtime, e.g. for instantiating problem specific accelerators. A prominent example is the recent Xilinx Zynq All Programmable System on Chip (SoC).
Fig. 8.1

In this work we target reconfigurable hybrid systems, i.e. heterogeneous FPGA/CPU platforms. With this setup we can exploit the efficiency of reconfigurable logic and the flexibility of a processor, having best of both worlds

In addition to the technological improvements, there are advances in the algorithmic domain as well. Although classical Monte Carlo (MC) methods are still prevailing, for example Multilevel Monte Carlo (MLMC) methods are more and more called into action [8, 10]. They can help to reduce the computational effort in total, but require a higher complexity in the controlling and require a more flexible execution platform.

In this chapter, we illustrate how we can combine the benefits of dedicated hardware accelerators with high flexibility as required by many practical applications on hybrid CPU/FPGA systems. For this purpose we have combined the current trends both from technology and computational stochastics to an option pricing platform for reconfigurable hybrid architectures. The proposed HyPER framework can handle a wide range of option types, is based on the state-of-the-art Heston model, and extensively uses dynamic runtime reconfiguration during the simulations. To derive the architecture, we have applied a platform based design methodology including Hardware/Software (HW/SW) split and dynamic reconfiguration.

In particular, we focus on the following points:
  • We propose an energy-efficient and modular option pricing framework called HyPER that is generically applicable to all kinds of hybrid CPU/FPGA platforms.

  • We show how the special characteristics arising from reconfigurable hybrid systems can be included in a platform based design methodology.

  • We have implemented HyPER configuration setup on the Xilinx Zynq-7000 All Programmable SoC relevant to practitioners. For this implementation we give detailed area, performance, and energy numbers.

8.2 Background and Related Work

The use of FPGAs for accelerating financial simulations has become attractive with the first available devices. Many papers are available that propose efficient random number generation methods and path generations [5, 13, 14, 18, 19, 20, 22]. Although most are focused on the Black-Scholes market model, there are a few publications on non-constant volatility models as well. Benkrid [20], Thomas, Tse, and Luk [18, 22] have thoroughly investigated the potentials of FPGAs and heterogeneous platforms for the Generalized Autoregressive Conditional Heteroskedasticity (GARCH) setting in particular. Thomas has come up with a Domain-Specific Language (DSL) for reconfigurable path-based MC simulations in 2007 [18] that supports GARCH as well. It allows to describe various path generation mechanisms and payoffs and can generate software and hardware implementations. That way, Thomas’ DSL is similar to our proposed framework. However, it does neither incorporate MLMC simulations nor automatic HW/SW splitting.

For the Heston setting, Delivorias has demonstrated the enormous speedup potential of FPGAs for classical MC simulations compared to CPUs and GPUs in 2012 [6]. The results are included in Chap.  3, but do neither include energy nor synthesis numbers.

De Schryver et al. have shown in 2011 that Xilinx Virtex-5 FPGAs can save around 60 % of energy compared to a Tesla C2050 GPU [15]. Sridharan et al. have extended this work to multi-asset options in 2012 [17], showing speedups up to 350 for one FPGA device compared to an SSE reference model on a multi-core CPU. De Schryver et al. have enhanced their architecture further to support modern MLMC methods in 2013 [16]. Their architecture is the basis for our proposed implementation in this paper.

Our HyPER platform was first presented in [3]. A hardware prototype was exhibited at the ReConFig 2013 and the FPL 2014 conferences.

8.2.1 Heston Model

The Heston model is a mathematical model used to price products on the stock market [9]. Nowadays, it is widely used in the financial industry. One main reason is that the Heston model is complex enough to describe important market features, especially volatility clustering [10]. At the same time, closed-form solutions for simple products are available. This is crucial to enable calibrating the model against the market in realistic time.

In the Heston model the price \(S\) and the volatility \(\nu\) of an economic resource are modeled as stochastic differential equations:
$$\displaystyle\begin{array}{rcl} dS_{t}& =& S_{t}rdt + S_{t}\sqrt{\nu _{t}}dWS_{ t}, \\ d\nu _{t}& =& \kappa \left (\theta -\nu _{t}\right )dt + \sigma \sqrt{\nu _{t}}dW\nu _{t}.{}\end{array}$$
(8.1)
The price \(S\) can reflect any economic resource like assets or indices as the S&P 500 or the Dow Jones Industrial Average. \(S\) can also be the stock price of a company. The volatility \(\nu\) is a measure for the observable fluctuations of the price \(S\). The fair price of a derivative today can be calculated as \(P =\mathrm{ \mathbb{E}}\left [g\left (S_{t}\right )\right ]\), where g is a corresponding discounted payoff function. Although closed-form solutions for simple payoffs like vanilla European call or put options exist, so-called exotic derivatives like barrier, lookback, or Asian options must be priced with compute-intensive numerical methods in the Heston model [10]. A very common and universal choice are Monte Carlo (MC) methods that we consider in this chapter.

8.2.2 Monte Carlo Methods for the Heston Model

Simulating the Heston model in Eq. (8.1) requires the application of an appropriate discretization scheme. In this work we have applied Euler discretization that has been shown to work well with in the MLMC Heston setting [11]. Discretizing Eq. (8.1) into k steps with equal step sizes \(\varDelta t = \frac{T} {k}\) leads to the discrete Heston equations given by:
$$\displaystyle\begin{array}{rcl} \hat{S}_{t_{ i+1}}& =& \hat{S}_{t_{i}} + r \hat{S}_{t_{i}}\varDelta t + \hat{S}_{t_{i}}\sqrt{\hat{\nu} _{t_{i }}}\sqrt{\varDelta t}\left (\sqrt{1 - \varrho ^{2}}\,Z_{i}^{S} + \varrho \,Z_{i}^{\nu }\right ), \\ \hat{\nu} _{t_{i+1}}& =&\hat{\nu} _{t_{i}} + \kappa (\theta -\hat{ \nu} _{t_{i}})\varDelta t + \sigma \sqrt{\hat{\nu} _{t_{i }}}\sqrt{\varDelta t}\;Z_{i}^{\nu }. {}\end{array}$$
(8.2)

While the initial asset price \(S0 = \hat{S}_{t_{0}}\) and r can be observed directly at the market, the five Heston Parameters κ, θ, σ, ϱ, and \(\nu 0 =\hat{ \nu} _{t_{0}}\) are obtained through calibration, compare Chaps.  2 and 10. \(Z_{i}^{S}\) and \(Z_{i}^{\nu }\) are two independent normal distributed random variables with mean zero and variance one. With this method an approximated solution \(\hat{S}_{t}\) can be obtained by linearly interpolating \(\hat{S}_{t_{0}},\ldots, \hat{S}_{t_{k}}\).

The classic MC algorithm estimates the price \(P =\mathrm{ \mathbb{E}}\left [g\left (S_{t}\right )\right ]\) of a European derivative with a final payoff function \(g(S)\) as the sample mean of simulated instances of the discounted payoff values \(g( \hat{S})\), i.e.,
$$\displaystyle\begin{array}{rcl} \mathcal{A}^{\mathrm{std}} = \frac{1} {N}\sum _{i=1}^{N}g( \hat{S}_{ i}),& & {}\\ \end{array}$$
where \(\hat{S}_{1},\ldots, \hat{S}_{N}\) are independent identically distributed copies of \(\hat{S}\).

For the implementation, we have used the same algorithmic refinements as in the data path presented in [16] (antithetic variates, full truncation, log price simulation).

8.2.3 The Multilevel Monte Carlo Method

The MLMC method as proposed by Giles in 2008 uses different discretization levels within one MC simulation [8]. It is based on an iterative result refinement strategy, starting from low levels with coarse discretizations and adding corrections from simulations on higher levels with finer discretizations. Figure 8.2 illustrates a continuous stock path with two different discretizations (4 and 8 steps). It is obvious that the computational effort required to compute one path increases for higher levels. For a predefined accuracy of the result, the MLMC method tries to balance the computational effort on all levels, therefore much more paths are computed on lower levels (with coarser discretizations). Since for finer discretizations the variances decrease, it is sufficient to simulate fewer paths on higher levels. In total, this leads to an asymptotically lower computational effort for the complete simulation [8]. For our investigated financial product “European barrier options”, MLMC has explicitly shown to provide benefits also for practical constellations [11].
Fig. 8.2

MLMC approximates the real stockpath that has infinite information, with multiple levels of discretization. In this case the path is approximated with four and eight discretization points

Let us formalize the idea. Without loss of generality one can assume that \(k = M^{L-1}\) discretization points for a fixed M and some integer L are sufficient to obtain the desired accuracy. We define \(\hat{S}^{(l)}\) for l = l0, , L as the approximated solution of Eq. (8.1) with Ml−1 discretization points. Then, in contrast to the classic MC estimate where the “single” approximation \(\hat{S}^{(l)}\) is used, one considers the sequence of approximation \(\hat{S}^{(l_{0})},\ldots, \hat{S}^{(L)}\). With the telescoping sum
$$\displaystyle\begin{array}{rcl} \hat{P} =\mathrm{ \mathbb{E}}\left [ \hat{S}^{(L)}\right ]& =& \mathrm{\mathbb{E}}\Big[\mathop{\underbrace{g\left (\hat{S}^{\left (l_{0}\right )}\right )}}\limits _{\hat{ D}_{l_{0}}}\Big] +\!\!\!\sum _{ l=l_{0}+1}^{L}\mathrm{\mathbb{E}}\Big[\mathop{\underbrace{g\left (\hat{S}^{\left (l\right )}\right ) - g\left (\hat{S}^{\left (l-1\right )}\right )}}\limits _{\hat{ D}_{l}}\Big].{}\end{array}$$
(8.3)
the single expected value by expected values of differences. Each of the expectations on the right are called levels. The MLMC algorithm approximates each of these levels with independent classic MC algorithms. To get a convergent and efficient MLMC algorithm, it is important that the variances of the levels
$$\displaystyle\begin{array}{rcl} V _{l} =\mathrm{ \mathbb{V}ar}\left [D_{l}\right ]& & {}\\ \end{array}$$
decay to zero fast enough. One way to achieve fast enough convergence of V l is choose a suitable discretization scheme and to let \(\hat{S}^{(l)}\) and \(\hat{S}^{(l-1)}\) depend on the same Brownian path. At the end the MLMC algorithm aims at reducing the overall computational cost by optimally distributing the workload over all levels [8].
In our setup it has been explicitly shown that Euler discretization is sufficient [11]. Using the discretized Heston model, Eq. (8.2), the price P can be calculated according to Eq. (8.3) with L individual MC algorithms. To reach the target accuracy ɛ, N l paths are evaluated on each level l, given by:
$$\displaystyle\begin{array}{rcl} N_{l} =\bigg \lceil \varepsilon ^{-2}\sqrt{V _{ l}}\sum _{k=l_{0}}^{L}\sqrt{V _{ k}}\bigg\rceil.& &{}\end{array}$$
(8.4)
The level variances are estimated with initial N l  = 104 samples. To let \(\hat{S}^{(l)}\) and \(\hat{S}^{(l-1)}\) depend on the same Brownian path, the same random numbers of the fine path are also used to approximate the coarse path, by adding up the M previous random numbers of the fine path.

8.3 Methodology

The classical MC algorithm only uses one fixed discretization scheme and is very regular. MLMC methods as introduced in the previous section are more complicated and rely on an iterative scheme with high inherent dynamics. For both methods dedicated FPGA architectures have been proposed [15, 16] (also see Sect. 8.2). However, they are static architectures that use exactly one single generic FPGA configuration throughout the entire computation and for all products.

In this work we systematically approach the inherent dynamics of the MLMC algorithm and propose a pricing platform that incorporates them. The dynamics in particular are:
  • The huge variety of the financial products and their different structure on how to calculate their price.

  • The specialty of the first level, which calculates only one price path, while the higher levels calculate two paths simultaneously.

  • The different number of discretization steps used in the iterative refinement strategy and the impact on the FPGA architecture.

Our goal is to design a pricing system that exploits the characteristics of the underlying hybrid CPU/FPGA execution platform efficiently for each part of the iterative algorithm and for all products traded on the market. A static design can never cover the complete range of those dynamics. Therefore we introduce a platform based design methodology that captures all the important characteristics of the problem and hybrid systems in general, but leaves enough flexibility to price arbitrary products and to target any specific hybrid device, see Fig. 8.3. It comes with three key features that address the dynamics:
  • A modular pricing framework that is easily extensible, and consist of reusable building blocks with standardized ports to minimize the effort for adding new products.

  • Extensive use of online reconfiguration of the FPGA to always have the best architecture available at any time, while still keeping the overhead of reconfiguration in mind.

  • Use of static optimization to find the optimal configurations for a given financial product and specific hybrid device. The goal of the optimizer is to exploit all available degrees of freedom, including HW/SW splitting and the flexibility of the modular architecture.

Fig. 8.3

The HyPER platform makes use of a platform based design methodology in which both the flexibility in the application and architectural space is captured by an automated approach. Once the user specifies the exact financial product and the target platform, the HyPER platform generates an optimal implementation for exactly this setup

With this new methodology it is possible to design a novel pricing system that is aware of the inherent dynamics of the problem. We introduce the resulting framework as the HyPER pricing system in the next section.

8.4 The HyPER Pricing System

HyPER is a high-speed pricing system for option pricing in the Heston model. It uses the advanced Multilevel Monte Carlo (MLMC) method and targets hybrid CPU/FPGA systems. To be able to efficiently price the vast majority of exotic options traded on the market it is based on reusable building blocks. To adapt the FPGA architecture to the requirements of the multilevel simulation in each part of the algorithm, it exploits online dynamic reconfiguration

8.4.1 Modular Pricing Architecture

For each level l the main steps of the MLMC algorithm are:
  1. 1.

    Simulate N l MC paths \(\hat{S}^{(l)}\) and optionally \(\hat{S}^{(l-1)}\) with k = M l time steps.

     
  2. 2.

    Calculate the coarse and fine payoff \(g\big(\,.\,\big)\) for each path.

     
  3. 3.

    Calculate the mean \(\mathrm{\mathbb{E}}[\hat{D}_{l}]\) and variance \(V _{l} =\mathrm{ \mathbb{V}ar}[\hat{D}_{l}]\) of the difference of all coarse and fine payoffs, according to Eq. (8.3).

     

This is done for l = l0, , L. For practical problems the first level l0 is typically equal to 1, the multilevel constant M equal to 4, and the maximum level L between 5 and 7. The number of MC steps NM l is roughly the same on each level and in the order of 1012 [8, 11].

Step 1 is the computationally most intensive part of the multilevel algorithm since it requires solving Eq. (8.2). This involves Brownian increment generation (Increment Generator) and calculating the next step of each path, step by step, path by path (Path Generator). In HyPER we therefore implement it on the FPGA part of the hybrid architecture. While for the first level l0 only one type of paths is calculated (Single-Level Kernel), for higher levels fine and coarse paths are required with the same Brownian increments. This makes the kernel more complicated and involves more logic resources (Multilevel Kernel). This covers the frontend of the HyPER architecture shown in Fig. 8.4.
Fig. 8.4

The HyPER frontend is a modular pipeline in which each blocks are fully utilized in each cycle. Payoff features are user defined and can be extended to generate path dependent features as required for the financial product being prices. While often times only a small set of features are required for a specific product, only the necessary blocks are mapped to the system

The Brownian increments are generated with a uniform Random Number Generator (RNG) and transformed to normally distributed random numbers. We choose the Mersenne Twister MT19937 for the uniform RNG and an Inverse Cumulative Distribution Function (ICDF) approach for the transformation. We further use antithetic variates as a variance reduction technique [10].

8.4.1.1 Payoff Computation

Part 2 involves the payoff computation and is strongly dependent on the option being priced. With the HyPER architecture we cover arbitrary European options, including barrier options that depend on whether a barrier is hit or not, and Asian options for which the payoff depends on the average of the stock price. For such path dependent payoffs every price of the path has to be considered. This leads to the dilemma that on the one hand a high-throughput payoff computation is needed, since the prices are generated on the FPGA fabric with one value per clock cycle. On the other hand the payoff computation may involve complex arithmetics that are not used in each cycle. Considering the payoff procedure carefully in the HW/SW splitting process is therefore crucial.

One of the key insights of the HyPER pricing system is to split the discounted payoff function \(g\big( \hat{S}_{t}\big)\) in two separate parts: A path dependent part F i and a path independent part h. The idea is to put the path dependent part F i on the FPGA and the independent part h on CPU. We express the payoff as:
$$\displaystyle\begin{array}{rcl} g\big( \hat{S}_{t}\big) = h\left (F_{1}\big( \hat{S}_{t}\big),\ldots,F_{n}\big( \hat{S}_{t}\big)\right ).& & {}\\ \end{array}$$
We call the path dependent functions F i features and choose them such that they contain as little arithmetic operations as possible. h does not directly depend on \(\hat{S}_{t}\). Let us look at an example: Asian Call options with strike K. Their payoff is given by:
$$\displaystyle\begin{array}{rcl} g^{\text{Asian}}\big( \hat{S}_{t}\big) = e^{-rT}\max \left (\frac{1} {k}\sum _{i=1}^{k} \hat{S}_{ t_{i}} - K,0\right ).& & {}\\ \end{array}$$
In this case the sum is path dependent and we can identify the result of this sum as feature F:
$$\displaystyle\begin{array}{rcl} & & F\big( \hat{S}_{t}\big) =\sum _{ i=1}^{k} \hat{S}_{ t_{i}},\quad \text{and}\;\;g^{\text{Asian}}\big( \hat{S}_{ t}\big) = h\left (F\big( \hat{S}_{t}\big)\right ), {}\\ & & \Rightarrow h(x) = e^{-rT}\max \left (k^{-1}x - K,0\right ). {}\\ \end{array}$$
For each MC path we now get one feature F instead of all prices from all the time steps. This dramatically reduces the bandwidth requirements for the backend, for example on level 5 from one value per cycle to one value every 1,024 cycles.
We have analyzed commonly traded European options1 and extracted five general features with which it is possible to price all of them. They are given in Figs. 8.4 and 8.5. Even highly exotic types like digital Asian barrier options are included. If a feature should not be present for a very specific option type, it can be easily identified and added to the list.
Fig. 8.5

The HyPER backend processes path features generated from the frontend and calculates statistics like the mean and variance of the payoff of the financial product being priced. Since features are generated once per path, the backend can process data from multiple frontends. Due to low demands on later blocks they can be mapped to the backend by the HyPER platform

In general, only very few features are necessary to define the payoff g of an option. This shows the general usefulness of this payoff split and suggest to consider HW/SW partitions after all features have been generated. For the first part of the architecture starting from RNG and continuing with path simulation, a HW/SW split is normally not suggestive due to high bandwidth requirements inside. We call this part of the architecture the HyPER frontend as depicted in Fig. 8.4.

8.4.1.2 HyPER Backend

Everything following is called the HyPER backend. The stock prices in the frontend are calculated as \(\log ( \hat{S}_{t})\). While some of the features like min∕max can even be applied to them, for most of the features we have to go back to normal prices at some point. So the backend includes exponential transformations for log-features, the path independent parts of the payoff functions h (Payoff), and a statistic block that calculates Step 3 of the MLMC algorithm (see Fig. 8.5). The rest of the algorithm is handled on the CPU. On higher levels where fine and coarse paths are calculated, the statistic is evaluated for the differences. The rate of this differences is half the price rate, and we can always use the statistic core with an Initiation Interval (II) of 2, a core that takes one value every second clock cycles. For the first level l0 we take the core with II = 1. Figure 8.6 shows the complete pricing system, the HyPER architecture.
Fig. 8.6

For a given financial product and hybrid system the HyPER platform generates well-matched HyPER architectures for each level, including an optimal HW/SW partitioning. It is then used by the multilevel control to compute the option price in multiple iterations. In each iteration a different HyPER architecture might be used and is reconfigured as necessary by the system

8.4.2 Runtime Reconfiguration

The overall performance of the hybrid option pricing system obviously depends on the actual configuration of the platform. It is important to note that for a given payoff function g there are still some degrees of freedom in the architecture, for example:
  • The number of HyPER instances on the FPGA part,

  • For each HyPER instance the number of frontends and where to make the HW/SW split in the backend, or

  • The type of communication core for CPU/FPGA communication.

When running the MLMC algorithm, the backend processes the payoff features F i from the frontend, one feature set F i per path. For level one, new features are generated every 4th clock cycle, which suggests no HW/SW split inside or after the backend. For level l = 5, features are generated only every 1,024th clock cycle, which suggests an early HW/SW split right after the frontend.

To account for these changing requirements for different levels, we propose an algorithmic extension in which we reconfigure the hybrid system for each level, see Algorithm 1.

This leaves the question on how to find the optimal HyPER configuration \(\mathcal{H}_{l}^{{\ast}}\) on each level, especially for the middle levels l = 2, , 4. This issue is addressed in the next sections.

Algorithm 1 Reconfigurable multilevel

Input: target accuracy ɛ, first level l0 and last level L

Output: Approximated price of the option \(\hat{P}\)

      load \(\mathcal{H}_{l_{0}+1}^{{\ast}}\), the optimal configuration for level l0 + 1.

      for l = l0, , L do

          Estimate the level variances \(V _{l} =\mathrm{ \mathbb{V}ar}\left [\hat{D}_{l}\right ]\), using an initial N l  = 104 samples.

      end for

      Calculate \(N_{l_{0}},\ldots,N_{L}\) according to Eq. (8.4).

      for all l in \(\left \{l_{0},\ldots,L\right \}\) do

          load \(\mathcal{H}_{l}^{{\ast}}\), the optimal configuration for level l.

          Evaluate extra paths at each level up to N l .

      end for

    Calculate the approximated price of the option \(\hat{P}\) according to Eq. (8.3).

8.4.3 Static Optimizer

Based on a given platform \(\mathcal{F}\) and payoff function g the static optimizer finds the set of optimal HyPER configurations used in the reconfigurable MLMC algorithm (Algorithm 1). This set is used to reconfigure the FPGA several times during the execution to boost the overall performance.

The optimizer maximizes the performance of HyPER by exploiting all degrees of freedom in the architecture. These are in particular:
  • The number of HyPER instances N,

  • The communication core Ψ, and

  • For each HyPER instance \(n \in \left \{1,\ldots,N\right \}\):
    • The number of frontends k n ,

    • The utilization factor of the frontend β n , and

    • The HW/SW split Ω k .

We express this freedom as \(\mathcal{H}_{l}\left (\mathcal{F},g;N,k_{1},\ldots,k_{N},\beta _{1},\ldots,\beta _{N},\varOmega _{1},\ldots,\varOmega _{N},\varPsi \right )\) and from now on only write \(\mathcal{H}_{l}\left (N,k_{n},\beta _{n},\varOmega _{n},\varPsi \right )\) for brevity. The best architectures are therefore defined by:
$$\displaystyle\begin{array}{rcl} \mathop{\mathop{\mathrm{maximize}}\nolimits }\limits_{N,k_{n},\beta _{n},\varOmega _{n},\varPsi }& & \qquad \mathrm{\mathbb{P}erformance}\big(\mathcal{H}_{l}\left (N,k_{n},\beta _{n},\varOmega _{n},\varPsi \right )\big), {}\\ \mathop{\mathrm{subject\;to}}\nolimits & & \qquad \mathrm{\mathbb{A}rea}^{\varphi }\big(\mathcal{H}_{l}\left (\ldots \right )\big) \leq \alpha ^{\varphi }\,\mathrm{\mathbb{A}rea}^{\varphi }\big(\mathcal{F}\big)\quad \forall \varphi, {}\\ & & \qquad \mathrm{\mathbb{L}oad}\big(\mathcal{H}_{l}\left (\ldots \right )\big) \leq 1, {}\\ & & \qquad \mathrm{\mathbb{B}andwidth}\big(\mathcal{H}_{l}\left (\ldots \right )\big) \leq \mathrm{ \mathbb{B}andwidth}\big(\varPsi \big), {}\\ \end{array}$$

:  

target reconfigurable hybrid system

 

g:  

given payoff function

 

N ∈ :  

number of HyPER instances,

 

Ψ ∈ available cores of ℱ:  

communication core,

 

k n  ∈ :  

number of frontends,

 

β n  ∈ [0, 1]:  

utilization factor of the frontends,

 

Ω n  ∈ {Ser., Exp, Payoff,  ML-Diff, Stats}: 

HW/SW split,

 

φ ∈ {LUT, FF,; BRAM, DSP}: 

FGPA resource type,

 

α φ

synthesis weight,

 

∀ n ∈ 1, , N

(for each HyPER instance).

 
This concludes the HyPER platform, the whole methodology is shown in Fig. 8.7.
Fig. 8.7

The HyPER platform generation methodology requires CPU and FPGA implementations for each of the building blocks of the HyPER architecture (Fig. 8.6). Based on these designs that might e.g. written in High-Level Synthesis (HLS) and C++, architecture models are derived specifying area, CPU, and bandwidth usage of each of the blocks for a specific target architecture (middle). The ILP optimizer uses those models for determining the HyPER architecture with the highest speed for a specific target options. In this process it generates different HyPER architecture (right) configurations for each of the levels, which are used synthesized to bitstreams and that reconfigured during runtime (right bottom)

8.5 HyPER on Zynq

In this section we thoroughly investigate the HyPER architecture for the Xilinx Zynq 7020 platform. It is a SoC that integrates a dual-core ARM Cortex-A9 processor and an FPGA into a tightly coupled hybrid system. For the financial product we choose barrier call options as a practical example.

In order to solve the static optimization we need to know how big the building blocks of the HyPER architecture from Fig. 8.6 are on our device \(\mathcal{F}\) in Fig. 8.9. For that, we have implemented all the building blocks for the FPGA with Xilinx Vivado HLS for f = 100 MHz and single precision floating-point arithmetic. To implement the ICDF we followed [14]. We have run a complete place & route synthesis for each core and extracted the resource usage numbers from Xilinx Vivado. As the cores include the full AXI interfaces, these are accurate numbers and they do not change much for composed designs. The obtained numbers are shown in Table 8.1.
Table 8.1

Building blocks of HyPER on the Zynq-7000 series. Based on their area, throughput, and CPU timing numbers the HyPER platform can find the optimal HyPER architecture

     

CPU

 

Building blocks

LUT

FF

BRAM

DSP

ns/val.

 

Increment generator:

      

Mersenne Twister

301

323

4

0

 

ICDF

451

592

4

1

 

Antithetic core

228

258

0

0

 

Path generators:

      

Single-level kernel

4,153

4,241

2

38

 

Multilevel kernel

5,607

5,326

6

43

 

Payoff features F i :

      

Barrier

180

158

0

0

 

Payoff h:

      

Call/put

440

396

0

2

6

 

Backend:

      

Feature

      

Serializer k×1

30k+65

65k+45

0

0

 

Exponential

900

384

0

7

250

 

Multilevel difference

372

355

0

2

5

 

Statistics II=1

2,170

1,612

4

9

6

 

Statistics II=2

1,454

1,164

2

6

3

 

Com. interface Ψ

   

Bandwidth

 

FPGA → CPU

LUT

FF

BRAM

in MB/s

 

Config-Bus 1×k

30k+50

2k+40

0

¡1

 

Streaming-Fifo

654

611

4

20

 

DMA-Core

1,864

3,122

4

350

 

Hybrid chip \(\mathcal{F}\)

LUT

FF

BRAM

DSP

ARM

 

Xilinx Zynq 7020

53,200

106,400

280

220

2 cores

 

Synthesis weight α

0.8

0.5

1

1

  

Furthermore, we have to know how much CPU load the blocks generate when they are mapped to the ARM processors. We estimated them by implemented the blocks as C++ functions and measuring the time per input value.

Additionally, we need to determine the speed and area of all available communication cores. We have used simple continuous streaming cores and measured the raw speed on the ARM cores. Finally, we have to specify how big our FPGA is and how much resources we want to use, as fully mapped devices cause routing congestions. The numbers of our complete analysis are given in Table 8.1.

We formulated the optimization problem, introduced in Sect. 8.4.3, as an Integer Linear Programming (ILP) problem and solved it with an ILP solver. As a result we got four unique architectures. The optimal parameters for each architecture \(\mathcal{H}_{l}^{{\ast}}\) are listed in Table 8.2. Their metrics area, load, bandwidth, and performance are given in Table 8.3. Section 8.5 visualizes the found architectures.
Table 8.2

Optimal HyPER architectures for barrier option pricing on the Zynq 7020

Optimal HyPER Architectures \(\mathcal{H}_{l}^{{\ast}}\)

 

for \(\mathcal{F} = \text{Xilinx Zynq 7020}\), g = Barrier Call Option

 

\(\mathcal{H}_{1}^{{\ast}} = \mathcal{H}_{1}\big(N = 2\), Ψ = DMA

 

k1 = 4, β1 = 1, Ω1 = Stats, 

 

k2 = 1, β2 = 1, \(\varOmega _{2} = \text{Exp}\big)\)

 

\(\mathcal{H}_{2}^{{\ast}} = \mathcal{H}_{2}\big(N = 1\), \(\varPsi = \text{Config-Bus}\)

 

k1 = 4, β1 = 1, \(\varOmega _{1} = \text{Stats}\big)\)

 

\(\mathcal{H}_{3}^{{\ast}} = \mathcal{H}_{3}\big(N = 1\), Ψ = DMA

 

k1 = 5, β1 = 0. 966, \(\varOmega _{1} = \text{Serializer}\big)\)

 

\(\mathcal{H}_{l}^{{\ast}} = \mathcal{H}_{l}\big(N = 1\), \(\varPsi = \text{Streaming-Fifo}\)

 

k1 = 5, β1 = 1, \(\varOmega _{1} = \text{Serializer}\big)\qquad \forall \;l \geq 4\)

 
Table 8.3

Showing the area, CPU, and bandwidth requirements and their performance of each of the optimal HyPER configurations \(\mathcal{H}_{l}^{{\ast}}\) given in Table 8.2

Optim.

Area in %

CUP

Bandw. (MB/s)

Perform.

 

HyPER

LUT

FF

BRAM

DSP

Load

Used

Available

MC step/s

 

\(\mathcal{H}_{1}^{{\ast}}\)

63

32

13

99

0.19

95

350

500 M

 

\(\mathcal{H}_{2}^{{\ast}}\)

58

27

15

87

0.00

0

¡1

400 M

 

\(\mathcal{H}_{3}^{{\ast}}\)

69

35

19

100

1.00

30

350

483 M

 

\(\mathcal{H}_{4}^{{\ast}}\)

67

33

19

100

0.26

7

20

500 M

 

\(\mathcal{H}_{5}^{{\ast}}\)

67

33

19

100

0.07

2

20

500 M

 
Based on the configurations found by the ILP problem we can find the minimum set of configurations that still reach the maximum performance. For each level we estimate the performance for each configuration and keep the ones with maximum performance. The result is Table 8.4. We see that configurations \(\mathcal{H}_{1}^{{\ast}}\), \(\mathcal{H}_{2}^{{\ast}}\), and \(\mathcal{H}_{3}^{{\ast}}\) are sufficient to reach maximum performance. \(\mathcal{H}_{l}^{{\ast}}\) for l ≥ 4 looks similar to \(\mathcal{H}_{3}^{{\ast}}\), just instead of a DMA it has a streaming First in, First Out (FIFO) for the interface to the CPU. Therefore we save one reconfiguration (Fig. 8.8).
Fig. 8.8

Optimal HyPER architectures for barrier option pricing on the Xilinx Zynq 7020, as given in Table 8.2. They are specific configurations of the architecture in Fig. 8.6 with abbreviations (IG – Increment Generator, SL – Singlelevel Path Generator, B – Barrier, Ex – Exponential, C – Call, St. – Statistics, ML CTRL – Multilevel Control, ML – Multilevel Path Generator, D – Multilevel Difference). Note that configuration \(\mathcal{H}_{1}^{{\ast}}\) contains two HyPER instances with different HW/SW partitioning

Table 8.4

List of optimal configurations for each levels, used to find minimal set of configurations

 

Configurations providing

 

Level

maximum performance

 

1

\(\mathcal{H}_{1}^{{\ast}}\)

 

2

\(\mathcal{H}_{2}^{{\ast}}\)

 

3

\(\mathcal{H}_{3}^{{\ast}}\)

 

\(\geq \) 4

\(\mathcal{H}_{3}^{{\ast}}\), \(\mathcal{H}_{4}^{{\ast}}\)

 

In the next section we evaluate these configurations in detail.

8.5.1 Results and Comparison

We have synthesized the optimal HyPER architectures \(\mathcal{H}_{l}^{{\ast}}\) as defined in Table 8.2 and implemented the complete multilevel algorithm. As an example, the floorplan of \(\mathcal{H}_{3}^{{\ast}}\) is shown in Fig. 8.9. On the ARM cores we boot a full Linaro Ubuntu. The Zynq platform supports online dynamic reconfiguration from the OS level in about 50 ms. The running system is visible in the picture in Fig. 8.7, with the ZC706 board on the left, the generated paths in the middle, and the power measurements on the right of the picture.
Fig. 8.9

Floorplan of the optimal HyPER Architecture \(\mathcal{H}_{3}^{{\ast}}\) for level 3, as defined in Table 8.2, highlighting the five frontends and the interconnect Ψ

To quantify the quality of our implementation, we have implemented a sophisticated CPU Heston pricer as a reference model. While Gaussian increment generation is only a small part of the HyPER architecture on FPGAs, it takes a significant time on CPUs (about 40 % of the overall runtime). We have compared several advanced libraries and selected the fastest Mersenne Twister RNG from the C\(++\) 11 standard library and the Ziggurat method from the GNU Scientific Library (GSL), which we adapted to use “single” precision floating-point. We have written the Monte Carlo step generation by hand and tuned its loop structure to support Advanced Vector Extensions (AVX). Additionally, we parallelized the whole program such that it uses all available cores. We have employed the Microsoft Visual C++ (MSVC) 2012 compiler, which has excellent auto-vectorization support, with compiler flags: “/O2 /arch:AVX /fp:fast /GL”. Profile-guided optimization gave an additional 10 % speedup. The result is a high-speed reference implementation that has received as much care as HyPER itself.

As an execution platform, we had several choices between servers, desktops, and laptops. Among all of them, the laptop proved to be the most energy efficient platform. It is a Dell Latitude E6430 with an Intel Core i5-3320M manufactured in 22 nm and supporting the latest AVX instructions. The Zynq 7020 is fabricated with a 28 nm process. Both chips are the most recent generations available today.

For measuring the speed, we have calculated the price for barrier call options for the Heston benchmark parameters [4] in Table 8.5 with a target precision of ɛ = 0. 005, start level l0 = 1, last level L = 5, and multilevel constant M = 4 (compare Chap.  4). We have validated that both implementations are correct and calculate the same number of MC paths N l on each level as given in Table 8.7. We have measured the overall execution time and the power consumption. For the laptop we kept the power consumption to a minimum by turning of the display and Wi-Fi and have removed all USB devices. We have run the simulation in a loop and measured the average power at the power plug.
Table 8.5

Benchmark Heston parameters [4]

\(S_{0}\)

κ

θ

σ

r

\(\nu _{0}\)

ϱ

K

T

Barrier

ɛ

 

100

3

0. 16

0. 4

0. 02

0. 1

− 0. 8

100

1

150

0.005

 
Table 8.6

Execution time and energy consumption

 

Time (S)

Power (W)

Energy (J)

 

MC on CPUa

111

31.3

3,460

 

MC on Zynqb

26.1

2.77

72

 

MLMC on CPUa

29.9

30.6

916

 

Level 1

2.9

30.6

88

 

Level 2

4.2

30.6

130

 

Level 3

5.2

30.6

158

 

Level 4

7.0

30.6

214

 

Level 5

10.7

30.6

327

 

Reconf.

 

HyPER on Zynqb

8.83

2.87

25.3

 

Level 1

0.58

3.05

1.77

 

Level 2

1.39

2.41

3.35

 

Level 3

1.52

3.38

5.14

 

Level 4

2.07

2.96

6.11

 

Level 5

3.03

2.80

8.48

 

Reconf.

0.25

1.86

0.47

 

aIntel Core i5-3320M

bZynq 7020

Table 8.7

Choosen MC path count N l for each level when pricing our benchmark Barrier call options with MLMC for the Heston parameters in Table 8.5

Level

Time steps

MC paths

Fine MC steps

Coarse MC steps

 

l

k = M l

N l [×106]

NM l [×108]

NMl−1 [×109]

 

“protect “newucase –m˝ultilevel Monte Carlo (l0 = 1, L = 5, M = 4):

     

1

4

72.5

0.29

 

2

16

27.8

0.44

1.11

 

3

64

9.2

0.59

1.47

 

4

256

3.2

0.83

2.07

 

5

1,024

1.18

1.21

3.03

 

Classical Monte Carlo:

     

1,024

15.3

15.62

 

To measure the power of the hybrid platform, we have used the Xilinx ZC702 evaluation board. It is possible to measure all power lanes on a 50 ms basis. We have run the simulation in a loop and added up the average power consumption of each power lane, except the 3. 3 V lane with about 0. 7 W. The measured power includes the Zynq 7020, Dynamic Random-Access Memory (DRAM), and oscillators, but not the peripherals like LEDs, USB, or HDMI controllers that have not been in use at all. To account for a power supply with 90 % efficiency, we have multiplied all measurements by 1.11.

The measured numbers are presented in Table 8.6. The CPU takes 30 s and 916 J, while HyPER takes 8. 6 s and 25 J to price the product. This means the HyPER architecture on the Zynq is 3.4× faster and 36× more power efficient than the reference system. As option pricing is perfectly scalable over multiple instances, HyPER is 36× faster than the CPU for a fixed power budget.

Without reconfiguration, the best architecture for all levels would be \(\mathcal{H}_{2}^{{\ast}}\). Pricing the same benchmark on this static architecture would take 10. 5 s, which would be 19 % slower than the HyPER architecture with online reconfiguration.

8.5.2 Comparison with Related Work

In this section we compare HyPER on Zynq to related work [15] and [16], introduced in Sect. 8.2. Although the architectures [15, 16] are limited to barrier options while HyPER supports the whole spectrum of traded options, we evaluate them in this specific setting.

Reference [15] is a classical MC implementation on a hybrid system containing a Virtex 5 and a laptop. The HyPER architecture is superior on both the algorithmic and implementation level:
  1. 1.

    On algorithmic level, HyPER uses the faster MLMC algorithm. In our setup (Table 8.5) MLMC needs to evaluate 3.8× less steps than classical MC (see Table 8.7). A more elaborate numerical comparison between both algorithms can be found in [8], where Giles shows speedups from 3 to 100×, mainly depending on the option types considered.

     
  2. 2.

    While [15] uses a Virtex 5 with a static configuration and a laptop, we present a runtime reconfigurable architecture on a tightly coupled hybrid architecture.

     
Based on the numbers given in [15], it would take 110 s and 3,861 J to run the benchmark. That means HyPER is 12.5× faster and 153× more power efficient than [15] due to improvements on algorithmic and implementation level, see Table 8.8 for more details.
Table 8.8

Comparison HyPER on Zynq with related work for Heston MC Barrier option pricing

Architecture

De Schryver et al. [15]

De Schryver et al. [16]

HyPER on Zynq (this work)

 

Algorithm

Classical MC

Multilevel MC

Multilevel MC

 

Total MC steps (×109)

15.62

4.13

4.13

 

Time (s)

110

9

 

Energy (J)

3,861

25

 

Monte Carlo barrier frontend:

    

LUT

5,480

10,300

6,770

 

FF

6,950

11,900

6,660

 

DSP

43

68

44

 

BRAM

10

128

22

 

Frequency (MHz)

102

120

100

 

Setup

Virtex 5 + Laptop

Virtex 6, synthesis only

Zynq-7000

 

The MLMC architecture in [16] is a partial implementation only and no time or energy numbers are given for a complete pricing system. Specifically only synthesis results are given for parts of the architecture, mainly what we call HyPER frontend. The payoff computation has not been implemented. That is why no complete comparison can be made. Section IV of [16] suggests to do the payoff computations on an embedded CPU. We have shown in Sect. 8.4.2 that such a HW/SW split leads to high CPU speed and bandwidth requirement for small levels. The work of [16] would therefore require a powerful CPU. With HyPER we have solved this issue by dynamically changing the HW/SW partitioning during runtime. As a result, we expect our architecture to be far superior in power efficiency compared to [16].

We can compare the synthesis results in [16] with our implementation of the HyPER frontend, including increment generator, multilevel path generator, and barrier checker (see Table 8.8). While the two devices have almost the same FPGA fabric and both implementations use single-precision floating-point as calculation formats, we see that our implementation is significantly ( > 35 %) smaller. This difference might come from the way [16] models what we call path generator. They have split this part of the architecture in more than 10 pieces, each modeled individually with HLS and connected by Advanced eXtensible Interface (AXI) Stream components. In contrast to this approach, we have modeled everything in one HLS component with no internal buffers, making the design efficient and compact, with just 145 lines of code.

8.5.3 Flexibility Performance Tradeoff

In the last two chapters we have clearly shown that HyPER is far superior in energy-efficiency compared to CPU solutions and related work architectures. In this chapter we will comment on flexibility, see Fig. 8.10.
Fig. 8.10

Showing the runtime and flexibility of the presented architectures. HyPER is a clear winner in both flexibility and efficiency, being 135× faster than the classic MC CPUs solutions

We define flexibility as how easy it is to add new financial products to the implementation. The MC CPU solution has the highest flexibility. While in general it as easy to add new payoff functions to a MLMC implementation as it is for a classical MC implementation, some payoffs need special treatments for MLMC algorithms. That is why we have a slightly less flexibility for MLMC for CPUs.

The classic MC architecture [15] is basically equivalent to a HyPER configuration with six SL frontends on the Zynq 7020. Adding new products to this architecture, requires the user to redesign new payoff blocks and change the whole FPGA architecture, which is extremely difficult. That is why this solution has the lowest flexibility.

With HyPER we get both another speedup and most off the flexibility back. The systematic approach of splitting payoffs into features and payoff functions makes it very easy to find a good FPGA implementation for new products. In addition, it is completely clear where to put such new blocks, while reusing the same interfaces and all other blocks. Furthermore, it is possible to create HLS templates where a user just has to add the mathematical formulas in C++ syntax, so in general no FPGA knowledge is necessary. Once the new blocks are written in HLS new models can be derived automatically from them and fed into the static optimizer that will then generate the most efficient architectures.

With HyPER, adding new products is both easy and efficient, giving us back most of the flexibility a CPU solution has.

8.6 Block Modeling Extensions

In Sect. 8.4.3 we introduced a simple formalism to model the HyPER architecture. However, the formalism can be further extended to incorporate even more flexibility of the architecture.
  1. 1.

    While we synthesized all our building blocks for a fixed frequency, it might be possible to have a different frequency for each building block, or at least each FPGA configuration. For that we would synthesize each block for a set of frequencies. Here, the expected tendency is that faster cores consume more FPGA resources, and we might find a more balanced configuration this way.

     
  2. 2.

    For more complex Payoff functions it is beneficial to consider all possible pipeline IIs, what leads to smaller designs due to operator reuse.

     
  3. 3.

    In the HLS tool it is easy to trade off Digital Signal Processor (DSP) blocks agains Lookup Table (LUT)/Flip-Flop (FF) usage. By compiling all the cores for different DSP usage factors it should be possible to find more balanced chip configurations with possibly even higher throughput.

     

8.7 Conclusions

The HyPER platform is a novel option pricing system for hybrid reconfigurable platforms. It is based on state-of-the-art Multilevel Monte Carlo (MLMC) methods, the Heston market model, and covers a wide range of option types. As a platform, HyPER captures all essential aspects of the problem and implementation space in a systematic way to generate efficient implementations. It provides a formalism to describe options in a way that they can be optimally mapped to a hybrid system. In this formalism payoff functions are systematically split in two parts, one targeting the FPGA and the other one the CPU. Furthermore, it provides a reconfigurable multilevel algorithm enabling the platform to adapt itself to the changing requirements for different parts of the algorithm. With specific information of the implementation platform including area, runtime, and bandwidth information HyPER is able to yield the optimal implementation to price a financial product.

We have used the HyPER platform to find an efficient implementation for barrier options on the Xilinx Zynq 7020 All Programmable SoC. The implementation is 3.4× faster and 36× more power-efficient than a highly tuned software reference on an Intel Core i5 CPU.

As far as the authors know, HyPER is the first flexible FPGA based Heston pricing system supporting a wide range of traded options, while clearly outperforming previous specialized Heston Monte Carlo implementations at the same time.

Footnotes

  1. 1.

    Call and put options of type Vanilla, barrier (upper or lower, knock-in or knock-out, one barrier or multiple, unconditioned or windowed), Asian (geometric or arithmetic), Digital, and Lookback (fixed or floating strike) or any combinations of such types.

Notes

Acknowledgements

We gratefully acknowledge the partial financial support from the Center of Mathematical and Computational Modelling (CM)2 of the University of Kaiserslautern, from the German Federal Ministry of Education and Research under grant number 01LY1202D, and from the Deutsche Forschungsgemeinschaft (DFG) within the RTG GRK 1932 “Stochastic Models for Innovations in the Engineering Sciences”, project area P2. The authors alone are responsible for the content of this work.

References

  1. 1.
    Advances and Innovations – Field Programmable Gate Arrays (FPGAs) (2015). http://careers.jpmorgan.com/experienced/jpmorgan/jobs/businesses/ib/technology/advances#Field_Programmable_Gate_Arrays__FPGAs_. Last access: 09 Feb 2015
  2. 2.
    Bernemann, A., Schreyer, R., Spanderen, K.: Accelerating Exotic Option Pricing and Model Calibration Using GPUs. Available at SSRN 1753596 (2011)Google Scholar
  3. 3.
    Brugger, C., de Schryver, C., Wehn, N.: HyPER: a runtime reconfigurable architecture for Monte Carlo option pricing in the Heston model. In: Proccedings of the 24th IEEE International Conference of Field Programmable Logic and Applications (FPL), Munich, pp. 1–8 (2014). doi:10.1109/FPL.2014.6927458Google Scholar
  4. 4.
    Brugger, C., de Schryver, C., Wehn, N., Omland, S., Hefter, M., Ritter, K., Kostiuk, A., Korn, R.: Mixed precision multilevel Monte Carlo on hybrid computing systems. In: Proceedings of the 2014 IEEE Conference on Computational Intelligence for Financial Engineering Economics (CIFEr), London, pp. 215–222 (2014). doi:10.1109/CIFEr.2014.6924076Google Scholar
  5. 5.
    Brugger, C., Weithoffer, S., de Schryver, C., Wasenmüller, U., Wehn, N.: On parallel random number generation for accelerating simulations of communication systems. Adv. Radio Sci. 12, 75–81 (2014). doi:10.5194/ars-12-75-2014. http://www.adv-radio-sci.net/12/75/2014/
  6. 6.
    Delivorias, C.: Case studies in acceleration of Heston’s stochastic volatility financial engineering model: GPU, cloud and FPGA implementations. Master’s thesis, The University of Edinburgh (2012). http://www.hpcfinance.eu/sites/www.hpcfinance.eu/files/Christos_Delivorias_0.pdf
  7. 7.
  8. 8.
    Giles, M.B.: Multilevel Monte Carlo path simulation. Oper. Res.-Baltim. 56(3), 607–617 (2008). doi:10.1.1.121.713Google Scholar
  9. 9.
    Heston, S.L.: A closed-form solution for options with stochastic volatility with applications to bond and currency options. Rev. Financ. Stud. 6(2), 327 (1993). doi:10.1093/rfs/6.2.327CrossRefGoogle Scholar
  10. 10.
    Korn, R., Korn, E., Kroisandt, G.: Monte Carlo Methods and Models in Finance and Insurance. CRC, Boca Raton (2010)CrossRefMATHGoogle Scholar
  11. 11.
    Marxen, H.: Aspects of the application of multilevel Monte Carlo methods in the Heston model and in a Lévy process framework. Ph.D. thesis, University of Kaiserslautern (2012)Google Scholar
  12. 12.
    Schmerken, I.: Deutsche Bank Shaves Trade Latency Down to 1.25 Microseconds. http://www.advancedtrading.com/infrastructure/229300997 (2011). http://www.advancedtrading.com/infrastructure/229300997. Last access: 09 Feb 2015
  13. 13.
    de Schryver, C., Schmidt, D., Wehn, N., Korn, E., Marxen, H., Korn, R.: A new hardware efficient inversion based random number generator for non-uniform distributions. In: Proceedings of the 2010 International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, pp. 190–195 (2010). doi:10.1109/ReConFig.2010.20Google Scholar
  14. 14.
    de Schryver, C., Schmidt, D., Wehn, N., Korn, E., Marxen, H., Kostiuk, A., Korn, R.: A hardware efficient random number generator for nonuniform distributions with arbitrary precision. Int. J. Reconfigurable Comput. (IJRC) 2012, 1–11 (2012). doi:10.1155/2012/675130. Article ID 675130Google Scholar
  15. 15.
    de Schryver, C., Shcherbakov, I., Kienle, F., Wehn, N., Marxen, H., Kostiuk, A., Korn, R.: An energy efficient FPGA accelerator for Monte Carlo option pricing with the Heston model. In: Proceedings of the 2011 International Conference on Reconfigurable Computing and FPGAs (ReConFig), Cancun, pp. 468–474 (2011). doi:10.1109/ReConFig.2011.11Google Scholar
  16. 16.
    de Schryver, C., Torruella, P., Wehn, N.: A multi-level Monte Carlo FPGA accelerator for option pricing in the Heston model. In: Proceedings of the IEEE Conference on Design, Automation and Test in Europe (DATE), Grenoble, pp. 248–253 (2013)Google Scholar
  17. 17.
    Sridharan, R., Cooke, G., Hill, K., Lam, H., George, A.: FPGA-based reconfigurable computing for pricing multi-asset barrier options. In: Proceedings of Symposium on Application Accelerators in High-Performance Computing PDF (SAAHPC), Chicago, IL (2012)Google Scholar
  18. 18.
    Thomas, D.B., Luk, W.: A domain specific language for reconfigurable path-based Monte Carlo simulations. In: International Conference on Field-Programmable Technology, 2007, ICFPT 2007, Kitakyushu, pp. 97–104 (2007). doi:10.1109/FPT.2007.4439237Google Scholar
  19. 19.
    Thomas, D.B., Luk, W., Leong, P.H., Villasenor, J.D.: Gaussian random number generators. ACM Comput. Surv. 39(4), 11 (2007). doi:http://doi.acm.org/10.1145/1287620.1287622
  20. 20.
    Tian, X., Benkrid, K., Gu, X.: High performance Monte-Carlo based option pricing on FPGAs. Eng. Lett. 16(3), 434–442 (2008)Google Scholar
  21. 21.
    du Toit, J., Ehrlich, I.: Local volatility FX basket option on CPU and GPU. Technical report, The Numerical Algorithms Group Ltd (2013). http://www.nag.co.uk/numeric/gpus/local-volatility-fx-basket-option-on-cpu-and-gpu.pdf. Last access: 09 Feb 2015
  22. 22.
    Tse, A., Thomas, D., Tsoi, K., Luk, W.: Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters. In: 2010 International Conference on Field-Programmable Technology (FPT), pp. 233–240 (2010). doi:10.1109/FPT.2010.5681495Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Christian Brugger
    • 1
  • Christian De Schryver
    • 1
  • Norbert Wehn
    • 1
  1. 1.Microelectronic Systems Design Research GroupUniversity of KaiserslauternKaiserslauternGermany

Personalised recommendations