Design Automation for Embedded Systems

, Volume 19, Issue 1–2, pp 189–221 | Cite as

Enabling FPGA routing configuration sharing in dynamic partial reconfiguration

  • Brahim Al Farisi
  • Karel Heyse
  • Karel Bruneel
  • João Cardoso
  • Dirk Stroobandt
Article

Abstract

Using dynamic partial reconfiguration (DPR), several circuits can be time-multiplexed on the same FPGA region, saving considerable area compared to an implementation without DPR. However, the long reconfiguration time to switch between circuits remains a significant problem. In this work we show that it is possible to significantly reduce this overhead when the number of circuits is limited. We lower the DPR overhead by reducing the number of configuration bits that needs to be reconfigured. This is achieved by keeping a (predetermined) part of the configuration frames of the DPR region constant/static for all circuits and, consequentially, sharing this part of the configuration between all the circuits. We show that this can be done maintaining the possibility to implement completely unrelated circuits in the DPR region. An extension of the Pathfinder algorithm, called StaticRoute, is presented. It is able to route the nets of the different circuits simultaneously in such a way that the routing of the different circuits is the same in the static part and may only differ in the dynamic part. Our approach is evaluated on the architecture of a commercially available SRAM-based FPGA. We explore how the static part in the configuration memory is best chosen and investigate the associated impact on maximum operating clock frequency as the number of circuits increases. Our experiments show that it is possible to make 50 % of the routing configuration static and therefore reduce the routing reconfiguration time by 50 %, without a significant impact on maximum clock frequency of the circuits. This corresponds to a reduction of total reconfiguration time of 34 %.

Keywords

FPGA Routing Dynamic partial reconfiguration 

1 Introduction

Dynamic partial reconfiguration (DPR) of Field Programmable Gate Arrays (FPGAs) allows designers to time-multiplex different circuits on the same chip area, called the reconfigurable region (RR). When using DPR, only a RR that can contain the biggest circuit, in terms of number of logic elements, is required. It is obvious that all the smaller circuits can also be implemented in this RR, as there are ample resources available. This way significant area savings can be achieved compared to a static implementation of the circuits separately in different regions of the FPGA. DPR therefore makes it possible to use smaller and thus cheaper FPGAs. Note that we only consider reconfiguring logic elements and not other FPGA components such as DSPs, BRAMs, etc.

The configuration memory of most commercial FPGAs consists of SRAM memory cells that control the contents of the look-up tables and the state of the routing switches. To implement a circuit on the FPGA, a configuration needs to be generated. This configuration contains the binary values that need to be written to the FPGA’s configuration memory cells. In conventional DPR systems, a configuration is generated for every circuit by implementing it independently in the RR. Each memory cell of the RR then corresponds to one binary value for each circuit. When these binary values are the same for a single memory cell in all circuits, they are called a static bit. Otherwise, they are called a dynamic bit. Memory cells containing a static bit do not need to be rewritten when switching between circuits.

However, in current FPGAs, the reconfiguration granularity is a collection of memory cells called a frame. A whole frame needs to be rewritten, even when only one memory cell of the frame contains a dynamic bit. The problem with conventional DPR systems is that the dynamic bits are scattered over the frames of the configuration memory, making it necessary to reconfigure almost the complete reconfigurable region [44]. This leads to long reconfiguration times, making DPR less useful for more dynamic applications [14, 33, 37, 38, 42, 51].

In this paper we present a novel approach to reduce the reconfiguration time by clustering dynamic bits so that fewer frames need reconfiguration. The largest portion of the configuration memory of an FPGA is used to program the routing. We therefore focus on reducing the time to reconfigure the FPGA’s interconnection network.

This new technique consists of two steps. In a first step the configuration memory of the RR’s routing switches is divided into a static and a dynamic part. Care needs to be taken that the memory cells of the static part reside in other frames than those of the dynamic part. Then, in a second step the interconnections of all circuits are routed simultaneously using our novel router, which we named StaticRoute. As shown further in this paper, this new technique is applicable on a typical commercially available SRAM-based FPGA. StaticRoute routes the interconnections in such a way that dynamic bits are avoided in the static switches of the RR. The dynamic bits are thus clustered in the dynamic part of the configuration memory. To the best of our knowledge, we are the first to propose such an approach.

StaticRoute uses a novel concept called switch congestion. In contrast to wire congestion, where a wire is congested if multiple connections try to use it, a switch is said to be congested when it is in a static part, but is controlled by a dynamic bit. In both cases the congestion represents situations which need to be avoided in the final solution. StaticRoute is based on the pathfinder algorithm [10, 36] and uses the negotiated congestion mechanism to resolve both wire and switch congestion.

In this paper we also explore how the static part is best chosen in the FPGA’s configuration memory. We use an architecture provided in the VTR framework, which is based on the commercially available Altera Stratix IV FPGA [41]. Furthermore, we research the impact on the maximum clock frequency of the circuits as the number of circuits increases. Our experiments show that, when the number of circuits is limited, the reconfiguration time can be reduced by 34 %, while the impact on the maximum clock frequency is acceptable.

This paper starts with an introduction to FPGAs in Sect. 2 and an overview of related work in Sect. 3. Then, in Sect. 4, we compare the conventional DPR tool flow and our newly proposed tool flow using StaticRoute. StaticRoute is described in more detail in Sect. 5. In Sect. 6.1 we present the results obtained in our first experiments on a simple 4-LUT architecture. Sections 6 and 7 present the results of more thorough experiments done on a 6-LUT architecture based on the commercially available Stratix IV FPGA. Section 6 explores how the static part is best chosen. In Sect. 7 the impact on performance is discussed as the number of circuits increases. Finally, we present our conclusions in Sect. 8.

2 FPGA configuration

For background, we give an overview of the typical FPGA architecture. We discuss a simplified version of current commercial FPGAs.

2.1 FPGA architecture

Commercial island-style FPGAs consist of a grid of Configurable Logic Blocks (CLB), Input Output Blocks (IOB) and a routing network (Fig. 1). Typically, an FPGA contains 10000s of CLBs and hundreds of IOBs.
Fig. 1

Schematic of a stripped down FPGA with 4 CLBs, routing network and IOBs. The small empty squares represent signal sources, the filled squares represent signal sinks. A small circuit using 2 CLBs is implemented on the FPGA

The Input Output Blocks are connected to the external pins of the chip and thus allow communication with the outside world.

The Configurable Logic Blocks contain LookUp Tables (LUT) and flip-flops that can be combined to implement any digital circuit (Fig. 2a). The LUTs implement arbitrary Boolean functions of their inputs using a truth table stored in configuration memory. The truth table contains the desired value for each combination of values of the inputs.
Fig. 2

Schematic of a CLB and detail of the routing network. a Schematic of a CLB with 4 inputs. Configuration memory is shown as small circles. b Schematic of a connection point between wires of the routing network. Configuration memory is shown as small circles

To implement sequential logic, the CLB’s flip-flop can be used to store the output of a LUT. A configuration bit determines wether or not the flip-flop is used. The inputs and outputs of the CLBs can be connected to each other and the IOBs using the FPGA’s routing network. This consists of a large amount of wires laid out between the grid of CLBs. Connections between wires and between wires and CLBs or IOBs are made using multiplexers, which are grouped in switch blocks and connection blocks (Fig. 2b). The state of these multiplexers is also stored in configuration memory.

The FPGA has special infrastructure and ports to write (and read) this configuration memory.

3 Related work

In [44] the modular design flow of Xilinx is described. This is the basic flow that is used to implement dynamic partial reconfiguration on XIlinx FPGAs. It is the same as the conventional DPR flow described in this paper. It runs the tool flow separately for each circuit and therefore reconfigures almost the complete reconfigurable region. It does not try to minimise reconfiguration time like in our work. Xilinx also provides a difference-based partial reconfiguration flow [20]. In this flow a designer is able to manually make low-level changes to one design, after which the flow generates a partial bitstream that incorporates only these changes. In contrast to the difference-based flow, our work considers several circuits being time-multiplexed on the same region in an automatic way.

Considerable research has been done on reducing the reconfiguration time of dynamic partial reconfiguration. Several authors consider changing the configuration infrastructure of the FPGA to speed up the configuration process. The authors in  [22, 49, 50] propose a time-multiplexed FPGA in which each configuration memory cell is backed by 8 bits of inactive storage in the configuration SRAM. This architecture was specifically designed for rapid switching between a limited number of contexts. In [34] the authors extend a regular configuration memory with a barrel shifter. This architectural enhancement does not only speed up reconfiguration, but also the relocation of portions of the configuration memory. In [17] an extension is proposed to the configuration infrastructure that allows to directly handle compressed bitstreams. In contrast to these works, our novel tool flow does not require any changes to the configuration infrastructure. The FPGA only needs to support partial reconfiguration.

Several publications, such as [15, 18, 26, 45], focus on reducing the reconfiguration time on a higher abstraction level in the tools, namely the one of modules or task graphs. They addressed the problem of scheduling several circuits on several reconfigurable regions in such a way that execution time is optimized. As main objective they try to overlap the configuration time and calculation time of tasks. The possibility of changing the location of a module or task, a concept called relocation, is also investigated in this context [9, 25].

Another approach is expediting the actual reconfiguration process. In [19, 30] this is done by using configuration pre-fetching. In this case, the configuration bit stream is partly loaded to on-chip memory before the actual reconfiguration process starts. This is of course only possible in applications where the new configuration is known beforehand.

Several authors show that the hardware that interfaces the internal reconfiguration port, the Hardware Internal Configuration Access Port (HWICAP), can be sped up significantly [13, 19, 21]. The techniques considered are Direct-Memory-Access and overclocking. In [13] custom circuitry is added to detect wether reconfiguration was successful.

In [24], it is shown that the inclusion of the configuration access port into the data path of a processor core using a Fast Simplex Link (FSL), instead of the Processor Local Bus (PLB) interface to the HWICAP, results in a speed up of the reconfiguration process.

When only Look-up tables (LUTs) need to be reconfigured and no routing, it is possible to reduce reconfiguration time using Shift-Register-LUTs instead of the internal configuration access port (ICAP) [2, 5, 23]. In [23] an algorithm is proposed that solves a variant of the travelling salesman problem for this purpose.

When the number of circuits that are time-multiplexed on the same FPGA region is limited, joint optimization approaches can be considered to reduce area. In [11], the authors use a high-level tool, called GAUT [16], to jointly optimize different data flow graphs of digital signal processing applications, by maximizing the similarities between control steps. They don’t use dynamic partial reconfiguration, but add multiplexers where necessary. In [43] the authors attempt to increase the correlation between the configurations by placing the LUTs of the different circuits in such a way that the connections between the LUTs overlap. A technique they call edge matching. The overlapping connections can be implemented using the same routing resources and thus the correlation between the different routing configurations is increased. They do, however, not consider the organization of configuration bits in frames. Also, in our previous work we showed that edge matching increases the total wire length of the circuits considerably [3].

Others focus on increasing the correlation between the truth tables of LUTs that occupy the same physical LUT [12, 39, 40]. This is done by changing pins to which the LUT inputs connect and taking advantage of don’t-care values. In [12] the placement is adapted to further increase the correlation of the LUTs. Increasing the correlation between LUTs can only have a limited impact when trying to reduce the reconfigurations time, as most configuration bits control how the FPGA’s interconnection network is configured. In our work, for example, the contribution of CLBs (that contain several LUTs) to the total configuration is found to be around 30 %. Also, a static CLB frame is only generated when all the bits of that frame correspond for the different circuits.

In previous work we chose to focus on increasing the correlations between the configuration of the routing of several circuits [3, 6]. One of the findings there was that constraining the placement of circuits to increase the correlation between configurations tends to increase the total wire length of the circuits significantly. One of the conclusions therefore was that it is better to retain placement and to focus on the routing algorithm. This previous work, and also the one in [12, 39, 40, 43], does not take into consideration the organisation of the configuration memory in frames. Reducing the number of dynamic bits in a frame is not enough. A frame has to be completely static for it not to be reconfigured when switching between circuits. The work in this paper is similar to that in [3, 6, 43] as it also focuses on reducing the reconfiguration time of the FPGA’s interconnection network. However, our novel tool flow for DPR does take the frames of the configuration memory into consideration.

There is also previous work from other authors that focuses on reducing the number of frames that need configuration. In [31] it is proposed to adjust the router for this purpose. This is done to reduce the load time of applications on FPGAs in general, it does not consider dynamic partial reconfiguration. The work in [48] adjusts the placer to reduce the number of frames and also considers dynamic partial reconfiguration. However, they do not consider increasing the correlation between the routing configurations of the different circuits as in this paper. The approach taken is also reducing the size of the bitstream of each of the circuits separately.

The work done in this paper is complementary to the work done about extending the regular configuration infrastructure, high-level task scheduling and expediting the actual reconfiguration process. It suits very well the work done about scheduling where a portion of the FPGA is used to time-multiplex different tasks. Our tool could be used to significantly decrease the time needed to switch between tasks in one reconfigurable region.

Our novel tool flow containing StaticRoute was first presented in [4]. The experiments in this previous work only implemented 2 circuits in the reconfigurable region, were done on a simple 4-LUT based architecture and only looked at the total wire length of a circuit as a metric for performance. Also, only marking the switch blocks as static was considered. In contrast to that work, we now explore how the static part is best chosen. In this more thorough exploration also the connection blocks are considered. Furthermore, experiments that implement more than 2 circuits in the reconfigurable region are presented. We also do the experiments on a realistic FPGA architecture with 6-LUTs and more complex configurable logic blocks, which is based on the Altera Stratix IV FPGA. Finally, we assess the actual impact on the maximum attainable clock frequency using the timing analyser available in the VTR framework [41].

4 Novel DPR tool flow

With dynamic partial reconfiguration (DPR) it is possible to implement different circuits, that are not needed at the same time, on the same FPGA area. This area is generally called the reconfigurable region (RR). Whenever one wants to change the implemented circuit, an amount of time is needed to rewrite the configuration memory. This is called the reconfiguration time. The subsystem that performs the reconfiguration is called the reconfiguration manager and is generally implemented in software. In this section we discuss two tool flows that use DPR: the conventional DPR flow and our novel approach using StaticRoute.

4.1 Conventional DPR flow

The conventional DPR tool flow implements every circuit separately in the reconfigurable region by following the typical steps of an FPGA CAD flow (synthesis, technology mapping, placement and routing), as shown in Fig. 3a. The flow generates a configuration for each circuit. To switch between the different circuits the reconfiguration manager overwrites the reconfigurable region with the appropriate configuration.
Fig. 3

The conventional DPR tool flow for implementing two circuits (a), compared to our novel approach which uses StaticRoute (b)

After implementation, there is a configuration available of the RR for each circuit. Each memory cell of the RR then corresponds to one binary value for each circuit. When these binary values are the same, they are called a static bit. Otherwise they are called a dynamic bit. If a memory cell contains a static bit, it means it has the same value for the different implemented circuits.

Static bits do not need to be rewritten when switching between circuits during run-time. However, in current FPGAs, the reconfiguration granularity is a set of memory cells called a frame. A whole frame needs to be rewritten, even when only one memory cell of the frame contains a dynamic bit. The problem with conventional DPR systems is illustrated in Fig. 4a: the dynamic bits are scattered over the frames of the configuration memory, making it necessary to reconfigure the complete reconfigurable region [44]. This may lead to reconfiguration times that are too long for most dynamic applications [14, 38].
Fig. 4

Scattering of the dynamic bits when using the conventional DPR tool flow (a), compared to the clustering in dynamic frames using StaticRoute (b)

4.2 Novel tool flow using StaticRoute

Before we present our novel tool flow, we make two observations regarding the configuration process. First, if we look at the configuration memory, we see that most configuration bits are used for the programmable interconnection network. Figure 5 shows that in our experiments only 30 % of the configuration memory is used for the Configurable Logic Blocks (CLBs). The rest is used for the configuration of the switch blocks (SBs) and connection blocks (CBs) in the routing. This is the main reason we focus on the routing part in the tool flow. Second, the actual problem is the scattering of dynamic bits over the frames. This is why, in our method, the configuration memory of the RR’s routing is split into two parts: a static part and a dynamic part. The goal of our novel router, StaticRoute, is to cluster dynamic bits in the dynamic part, so that only those configuration frames need to be reconfigured. This is illustrated in Fig. 4b.
Fig. 5

The relative contribution of CLBs and routing to the size of the configuration memory

The proposed tool flow is presented in Fig. 3b. Instead of running the tool flows completely separately for the different circuits, the idea is to have a joint routing of the circuits. In this case, the tool flow is run separately until placement, generating a placed design for each circuit. Then the nets of all the circuits are merged into one set of nets. The tool flow was designed this way because a previous study concluded that a joint placement algorithm tends to increase the wire length of the circuits significantly and it therefore was better to focus solely on the routing algorithm [6].

The set of merged connections is then routed with StaticRoute. StaticRoute thus routes the nets of all circuits simultaneously. This is of course only viable when the number of circuits is limited. StaticRoute detects dynamic bits and clusters them in the dynamic routing frames. Figure 3b shows that the result of our novel tool flow is one static configuration and a dynamic configuration for every circuit. The static configuration contains the binary values for the static part of the RR’s routing. It only needs to be loaded to the FPGA’s configuration memory once at start-up. The dynamic configurations contain the remainder of the configuration of the RR, needed to reconfigure the Configurable Logic Blocks (CLBs) and the dynamic part of the routing. These are used to switch between circuits during run-time. Since the dynamic configurations are much smaller than a configuration of the complete RR, reconfiguration time can be reduced considerably.

Section 5 discusses StaticRoute in more detail. As the static part should not be chosen randomly in the configuration memory of the FPGA, Sect. 6 explores how the static part is best chosen. Our approach is specifically focused on the case where the number of circuits to be implemented in the reconfigurable region is limited. Section 7 discusses how our technique impacts the maximum clock frequency as the number of implemented circuits increases.

5 StaticRoute

As can be seen in Fig. 3, our novel tool flow reuses the synthesis, technology mapping and placement tools of the conventional DPR flow. However, at the heart of our flow is a novel router called StaticRoute.

StaticRoute is based on the pathfinder algorithm [10, 36], the most commonly used algorithm for FPGA routing. In the first part of this section we therefore first describe pathfinder.

Before StaticRoute is used the configuration frames that control the routing are marked as being either static or dynamic. We note that this happens in the CAD software and therefore is possible on any commercially available SRAM-based FPGA. Then StaticRoute routes the connections of all circuits simultaneously in such a way that there are no dynamic bits in memory cells that reside in static frames. To be able to do this, only the representation of the FPGA’s architecture needs to be extended. The FPGA architecture itself does not need any adaptation. This is explained in Sect. 5.2.

Detecting dynamic bits after the configurations are generated is easy. When a memory cell has different values in the different configurations, it contains a dynamic bit. This means it will have to be rewritten during run-time. In Sect. 5.3 we show, however, that it is also possible to detect dynamic bits during routing.

Finally, Sect. 5.4 handles how StaticRoute extends the cost function of pathfinder, so that dynamic bits are avoided in the static part of the configuration memory.

5.1 The Pathfinder algorithm

A conventional router calculates the Boolean values that need to be stored in the memory cells of the configurable interconnection network so that the physical logic blocks are connected as specified by the nets in the mapped circuit. The main algorithm used to solve this problem is pathfinder [10, 36].

pathfinder presents the available routing resources of the FPGA in an easy-to-explore data structure, the routing resource graph (RRG). An example of part of an RRG, representing a routing multiplexer, is presented in Fig. 6. The RRG is a directed graph, where each node represents a routing wire on the FPGA and each directed edge represents a routing switch on the FPGA.1
Fig. 6

A routing multiplexer (a) with its corresponding routing resource graph (b)

In the pathfinder algorithm, the connections that need to be routed are organized in nets. These are sets of connections that share the same source. During the first routing iteration, nets can share resources at no extra cost and thus, each net is routed with a minimum number of wires. In subsequent routing iterations, the algorithm rips up and reroutes all the nets in the input circuit. A wire is said to be congested if it is used by more than one net. Wire congestion is not allowed in the final solution because this results in short-circuits. That is why the routing iterations are repeated until no shared resources exist or, in other words, the wire congestion is resolved. This is achieved by gradually increasing the cost of sharing resources between nets, a technique called negotiated congestion. The cost function of a wire in the RRG is
$$\begin{aligned} cost(n)=b(n)\cdot p(n) \cdot h(n), \end{aligned}$$
(1)
where \(b(n)\) is the base wire cost (equal to 1), \(p(n)\) is the present wire congestion penalty and \(h(n)\) is the historical wire congestion penalty.

The factor \(p(n)\) is used to avoid wire congestion during one routing iteration. The factor \(h(n)\) is used to make heavily used resources in past routing iterations more expensive. In this way a wire congestion map is built, which enables nets to avoid routing through heavily congested wires, if possible.

The present congestion penalty, \(p(n)\), is updated whenever a net is rerouted. The update is done as follows
$$\begin{aligned} p(n) = \left\{ \begin{array}{ll} 1 &{} \text {if } c(n) > o(n)\\ 1+ p_{f}.(o(n)-c(n)+1) &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(2)
where \(c(n)\) represents the capacity of the node and \(o(n)\) is the occupancy of the node. The capacity is the maximum number of nets that can legally use the routing resource. The occupancy of a node is the number of nets that are presently using it. The factor \(p_{f}\) is used to increase the sharing cost as the algorithm progresses. This is explained below.
The historical congestion penalty is updated after every routing iteration. The update is done as follows
$$\begin{aligned} h^i(n) = \left\{ \begin{array}{ll} 1 &{} \text{ if } i=1\\ h^{(i-1)}(n) &{} \text {if } c(n) \ge o(n)\\ h^{(i-1)}(n)+ h_{f}.(o(n)-c(n)) &{} \text {otherwise} \end{array}\right. \end{aligned}$$
(3)
Again, the factor \(h_{f}\) is used to control the impact of the historical congestion penalty on the total resource cost.

The term \(o(n)-c(n)\) represents the overuse of a node. Note that both the present and historical congestion mechanisms associate higher penalties with higher overuse.

The way the factors \(p_{f}\) and \(h_{f}\) change as the algorithm progresses is called the routing schedule. The routing schedule proposed in [10] is used. In this schedule, \(h_{f}\) is held equal to 1 independent of the iteration. On the other hand, \(p_{f}\) is initially set to 0.5 and is doubled in every subsequent iteration. More details on pathfinder can be found in [10].

5.2 Extended routing resource graph

In a standard RRG the nodes represent wires and the directed edges represent switches. StaticRoute does not make use of a standard RRG, but of an extended RRG. This extended RRG does not only represent the wires as nodes but also the switches. An example of an extended RRG is shown in Fig. 7. The round nodes are wires and the square nodes switches.
Fig. 7

An example of a multiplexer controlled by some dynamic bits (a) and one controlled only by static bits (b)

Such a representation is necessary for two reasons. First, it is possible to mark certain switches as being static. The rest of the switches are considered dynamic. Of course, the static (dynamic) switches are the ones that are controlled by memory cells residing in frames that are marked static (dynamic). This way the RR’s routing, and also its configuration memory, is split up in a static and a dynamic part. Second, in an extended RRG certain information can be associated with the switches during routing. This can be a cost or information on which circuits are using a certain switch.

We note that this is an extension of the representation of the routing architecture. Also marking the switches as being either static or dynamic happens in the CAD software and not in the actual architecture. The FPGA and it’s routing architecture do not need any adaptations for our algorithm.

5.3 Detecting dynamic bits in the extended RRG

As can be seen in Fig. 3b, the input to StaticRoute is the list of the nets of all circuits that need to be implemented in the reconfigurable region. These nets are all annotated with the ID of the circuit they belong to. When a net uses a node during routing, it is also annotated with this information. So, each wire stores information regarding the updated set of circuits that use that wire. Our starting point in this section is therefore an extended RRG annotated with this information.

As will be explained later, a switch marked as static that also contains a dynamic bit will be associated with an extra cost. This results in a final solution in which this does not occur any more. That is why we need to detect dynamic bits, to be able to calculate that extra cost. In this subsection we only discuss the detection of whether a switch contains a static or dynamic bit, regardless of being in a switch that is marked static or dynamic. The novel cost function that combines the detection of dynamic bits with the information about the switches is discussed in Sect. 5.4.

Let us assume 4 circuits, numbered 1 to 4, to be implemented in the RR. In Fig. 7a we see a routing multiplexer of the RR, represented as an extended RRG. It connects the top wire to its output for circuits 1 and 2. The middle wire is connected to the output for circuit 3. Circuit 4 does not use the routing multiplexer and therefore is not shown in the figure. Let us focus on the top switch. It follows that this switch needs to be closed for circuit 1 and 2. It needs to be open for circuit 3, as not to add any extra capacitance of the wires of the other circuits. It has a don’t-care value for circuit 4, because this circuit is not using this multiplexer. In this case, the switch clearly is controlled by a dynamic bit, since it has different values for different circuits.

Let us look at a second example in Fig. 7b. In this case the top switch has value 1 for circuits 1, 2 and 3. And it has a don’t-care value for circuit 4. It is clear that when a switch and its connected wires are used by the same circuits, it does not have to be changed during run-time. The switch is closed for the circuits that use it and has a don’t-care value set to 1 for the other circuits. The remaining switches are static, because they are not used by any circuit. They always have the static value zero.

In general, in the extended RRG, a switch node \(S\) connects two wire nodes \(W_{in}\) and \(W_{out}\). Let us assume that \(S\) is used by a set of circuits \(C_S\). \(W_{in}\) and \(W_{out}\) are used by \(C_{in}\) and \(C_{out}\) respectively. We state that \(S\) is controlled by a dynamic bit if:
$$\begin{aligned} ( (C_S \ne C_{in}) \vee (C_S \ne C_{out}) ) \wedge C_S \ne \phi . \end{aligned}$$
(4)
Fig. 8

An example of a switch S controlled by a dynamic bit

Figure 8 shows an example with all the terminology indicated on the figure. In this figure we see that both switch \(S\) and wire node \(W_{in}\) are used by circuits 1 and 2, whereas the wire node \(W_{out}\) is used by circuits 1, 2 and 3. Switch \(S\) therefore contains a dynamic bit as \(C_S\) is different from \(C_{out}\).

There is, however, a property that allows to formally simplify this condition: if a circuit uses a switch, then this circuit will also use both connected wires. The condition to detect a dynamic bit in switch \(S\) is therefore reduced to:
$$\begin{aligned} ( (\vert C_S \vert < \vert C_{in} \vert ) \vee (\vert C_S \vert < \vert C_{out} \vert ) ) \wedge \vert C_S \vert \ne 0. \end{aligned}$$
(5)
This equation is a formal simplification because only the sizes of the sets are compared and not the elements of the sets. The conditions \(C_S \ne \phi \) and \(\vert C_S \vert \ne 0\) are necessary to exclude unused switches, which are always static. We note that Eqs. 4 and 5 are equivalent, they don’t effect the end result. Both can be used to detect dynamic bits. Therefore for both conditions exactly the same results in terms of reduction in reconfiguration time and maximum clock frequency are obtained.

5.4 Novel cost function

In the previous sections we explained that in the extended RRG some switches are marked as being static. We also presented a way to detect dynamic bits in the extended RRG. In this section we introduce the term switch congestion. A switch is said to be congested when it is marked as static, but is controlled by a dynamic bit.

In the pathfinder algorithm the cost of using a wire only takes into account wire congestion. The nets are ripped up and rerouted until there are no wires that are congested. In this section we describe how we extended this algorithm to also take switch congestion into consideration. StaticRoute rips up and reroutes the nets of all circuits until all wire and switch congestion is resolved.

In the pathfinder algorithm a connection of a net is routed by searching the path of wire nodes with lowest cost in the RRG. In our algorithm the same happens in the extended RRG. Except that, to take switch congestion into consideration an extra cost per wire is added. In our novel cost function, the cost of a node in the extended RRG is
$$\begin{aligned} cost(n,c)= \left\{ \begin{array}{ll} cost_w(n,c) + cost_s(n) &{} \text {if n is a wire}\\ 0 &{} \text {if n is a switch} \end{array} \right. \end{aligned}$$
(6)
where \(cost_w(n,c)\) is the wire congestion cost and \(cost_s(n)\) the switch congestion cost associated with wire node \(n\) and circuit \(c\) (we discuss the influence of \(c\) later).

Out of Eq. 5 follows that when a wire is used, the congestion of all the static switches that are connected with it in the extended RRG are affected. That is why the cost of switch nodes in the path of the RRG is zero and only the wires contribute to the cost of a net. Switch nodes are used to hold information needed to determine the switch congestion penalty \(cost_s(n)\).

The term \(cost_w(n,c)\) takes wire congestion into consideration and is very similar to Eq. 1. Remember that StaticRoute routes the nets of all circuits simultaneously. However, when a net of the circuit \(c\) is routed, only the other nets of \(c\) are taken into consideration for the wire congestion. This is because nets of different circuits do not cause wire congestion. They are never present on the FPGA at the same time and therefore can share wires. The equation for \(cost_w(n,c)\) is
$$\begin{aligned} cost_w(n,c)= p(n,c) \cdot h(n,c), \end{aligned}$$
(7)
where \(p(n,c)\) and \(h(n,c)\) are the present and history wire congestion penalty for circuit \(c\). These are calculated as in Eqs. 2 and 3, the only difference is that \(o(n)\) is replaced with \(o(n,c)\).

The term \(cost_s(n)\) takes switch congestion into consideration. As mentioned in the previous section, to determine whether a switch is congested the information associated with both the wires it connects is needed. Therefore the router calculates the cost for switch congestion using the union of the fan-in switch nodes of the current wire node \(n\) and the fan-out switch nodes of the previous wire node in the routing path currently being evaluated. This set of switch nodes is called \(S(n)\). Figure 9 shows an example where the switches of \(S(n)\), associated with a wire node \(n\), are identified in black. In this set \(S(n)\), we use Eq. 5 to identify the subset of congested switches, which we call \(C(n)\).

Given a wire node \(n\), with its associated set of congested switches \(C(n)\), we propose the following equation for the switch congestion penalty
$$\begin{aligned} cost_s(n)= p_s(n) \cdot h_s(n), \end{aligned}$$
(8)
where \(p_s(n)\) and \(h_s(n)\) are the present and history switch congestion penalty. The factor \(p_s(n)\) resolves switch congestion during one iteration and is given by:
$$\begin{aligned} p_s(n)= 1 + \vert C(n) \vert \cdot p_{f}. \end{aligned}$$
(9)
Note that if the use of a wire results in more congested switches this wire is more penalized. This is similar to the pathfinder algorithm, in which wires with more overuse result in a higher penalty.
Fig. 9

Example where the switches of the set S(n) for a wire node n are indicated in black in the extended RRG

The factor \(h_s(n)\) takes into consideration the switch congestion that occurred in the previous iterations. It uses the congestion map that is built in the switch nodes. It is given by:
$$\begin{aligned} h_s(n)= \underset{m \in C(n)}{\sum } h_s(m), \end{aligned}$$
(10)
where \(h_s(m)\) is the history switch congestion penalty of one switch node \(m\). This is updated every routing iteration \(i\) as follows:
$$\begin{aligned} h_s^i(m) = \left\{ \begin{array}{ll} 0 &{} \text{ if } i=1\\ h^{(i-1)}(m) &{} \text {if m is not congested}\\ h^{(i-1)}(m)+ h_{f} &{} \text {otherwise} \end{array}\right. . \end{aligned}$$
(11)
The way the factors \(p_{f}\) and \(h_{f}\) change as the algorithm progresses is called the routing schedule. We use the same routing schedule as the pathfinder algorithm.

6 Selection of the static part

6.1 First experiments on a simple 4-LUT architecture

To explore whether our proposed tool flow could have benefits, we first conducted experiments on a simple architecture with 4-LUTs. Since there was no timing analyser present in our initial implementation, we looked at the total wire length of each circuit to get an idea of the impact of our experiments on the performance. These experiments were first presented in [4] and are discussed in more detail in this section.

6.1.1 Benchmarks

In this section experiments were conducted using 3 different applications. In the first 2 experiments typical multi-mode applications were used: a regular expression matching (RegExp) and an adaptive filtering application (FIR). In the last 2 experiments general MCNC benchmarks were used.

In [47] a tool was developed that can generate a hardware engine, written in VHDL, that matches a certain regular expression. In the first experiment, we chose 5 middle-sized regular expressions out of the Bleeding Edge rules set [1] and with this tool generated the corresponding circuits. In the second experiment we generated 5 fixed coefficient finite impulse response (FIR) filters. The FIR filters are fully pipelined, have 16 taps and the width of the input and the coefficients is 8 bit. The values for the coefficients were chosen randomly, after which all the constants were propagated. Such FIR filters are 3 times smaller than the generic version.

In the third experiment, we chose 5 circuits out of the general MCNC benchmark suite [53] that were of similar size compared to the rest of the circuits in the previous experiments(MCNC). In the fourth experiment we chose 5 of the smaller circuits out of the MCNC20 benchmark suite [53]. The names of the MCNC and MCNC20 circuits used in this experiments can be found in Table 1.
Table 1

Name of the MCNC and MCNC20 circuits used in the experiments

MCNC

e64, rd73, s400, s1238, s1494

MCNC20

apex4, alu4, tseng, ex5p, misex3

For every set of circuits the minimum, average and maximum number of LUTs are reported in Table 2. In each set all possible 10 combinations of 2 circuits out of 5 were chosen. These combinations of 2 circuits were each time implemented using both the conventional DPR flow and our novel approach using StaticRoute.
Table 2

Size of the LUT circuits used in the experiments

 

Minimum

Average

Maximum

RegExp

500

516

543

FIR

235

302

371

MCNC

264

310

404

MCNC20

1,135

1,323

1,544

6.1.2 FPGA architecture

StaticRoute was in these first experiments implemented based on our JAVA version of the VPR (Versatile Place and Route) wire-length driven router [10]. VPR is the most commonly used academic tool for place and route algorithms in FPGAs [10]. The FPGA architecture used is described in 4lut_sanitized.arch. This is an FPGA architecture file included in the distribution of VPR (version 4.30). It is a simple architecture that has logic blocks containing one 4-LUT and one flip-flop and the wire segments in the interconnection network only span one logic block. The minimum square area of the FPGA was chosen such that it fits both circuits. Since there is no other functionality implemented on the FPGA, the reconfigurable region comprises the complete FPGA. The channel width in these experiments was chosen 50 % bigger than the minimum needed. Two modifications were made to the routing of this architecture to better resemble the commercial available FPGAs. Wilton switch blocks and unidirectional wires are used instead of disjoint switch blocks and bidirectional wires [28].

6.1.3 Results

Two metrics were used to evaluate the quality of the implementation: reconfiguration time and wire length. The reconfiguration time gives an indication on how fast the system can adapt when necessary. Wire length is an important metric for the quality of a circuit, since it correlates with power usage and performance (maximum clock frequency) of a circuit [10]. We focus on the effect the relative size of the static portion of the configuration memory has. We average the results over the implemented circuits and use error bars to indicate minimum and maximum values.

Reconfiguration time The wires in an island style FPGA are organized as channels in between the logic blocks (LBs). The switches that connect the wires and the logic block pins are aggregated as connection blocks (CBs) and switch blocks (SBs). The connection blocks connect the logic block pins to the wires in their neighbouring channel while the switch blocks connect the wires from one channel to wires from an adjacent channel.

Marking the static bits in the routing infrastructure was done based on the switch blocks. For each set of benchmarks, we compare the cases where 50 and 75 % of the switch blocks was marked static. This is done in such a way that the selected switch blocks are spread uniformly over the FPGA’s area. The connection blocks in the routing were all kept dynamic.

In this experiment we look at the total reconfiguration time. For conventional DPR this is the sum of the LUT bits and the bits that control the switches. For our novel approach, that uses StaticRoute, we use the same, but don’t count the routing bits that reside in the static portion of the configuration memory. As explained earlier, the bits in the static portion are shared by all circuits and thus don’t need to be rewritten during run-time.

In Fig. 10 the relative decrease in reconfiguration time is shown compared to the conventional DPR flow. Static switch blocks do not need to be reconfigured during run-time. The decrease in reconfiguration time is therefore directly proportional to the relative portion of static switch blocks. It is approximately the same for all benchmarks considered. This is, on average, 40  % when half of the switch blocks are static to 50 % for 75 % static switch blocks. As the CLBs were all kept dynamic, this total reduction of reconfiguration time comes from the reduction in routing reconfiguration time (in this section there is only a reduction in reconfiguration time of switch blocks, the connection blocks were kept dynamic). The reduction in routing reconfiguration time was, on average, 44 and 56 % for the cases of 50 and 75 % static switch blocks, respectively.
Fig. 10

Decrease in reconfiguration time compared to conventional DPR

Wire length In our proposed tool flow the different circuits are not implemented separately, as is the case in the conventional DPR flow. Instead, the circuits are routed together using StaticRoute. In this section we assess the impact this has on the wire length. Each circuit uses a set of wires when it is active. We compare the size of this set in the case of implementation with the conventional DPR flow and StaticRoute. This is then averaged over all circuits.

The results are shown in Fig. 11. Again, the relative increase in wire length is dependent upon the relative size of the static portion. When 50  % of the switch blocks are marked static, then the wire length increases on average a few percent and the maxima are around 5  %. For some benchmarks the wire length even decreases a little when using StaticRoute, this is because both pathfinder and StaticRoute are heuristics. For 75  % static switch blocks the average increases somewhat, especially for the regular expression applications. Also the maxima increase to around 10  %. We can conclude that the wire length increase of using this technique, when implementing 2 circuits, is limited.
Fig. 11

Wire length increase of StaticRoute compared to conventional DPR

The experiments above were done in our JAVA based version of VPR, which does not have a timing analyser. This is why in these initial experiments we looked at the total wire length of the circuits to get an indication of the impact on performance. Because of the initial promising results, we decided to implement StaticRoute in the VTR framework. This would enable us to do more thorough experiments. In this framework an architecture file based on a commercial Altera Stratix IV FPGA is available. Also a timing analyser is present that can extract the maximum clock frequency of the circuit. These new, more thorough experiments are described in the next two sections. The first section explores how the static part is best chosen in the configuration memory of the routing. Finally, how the impact on reconfiguration time and performance scales as the number of circuits increases, is discussed in Sect. 7.

6.2 Thorough exploration on a realistic 6-LUT architecture

In the previous section we only considered the cases where 50 % and 75 % of the switch blocks was marked static in a simple 4-LUT based architecture. Marking connection blocks as static was not considered at all. In this section we have a more thorough exploration of how the static part in the configuration memory of the routing is best selected. We do not only consider the switch blocks, but also the connection blocks in a more complex 6-LUT based architecture. We will do this exploration by looking at two important quality metrics of the implementation: reconfiguration time and maximum attainable clock frequency of the circuits. The reconfiguration time gives an indication of how fast one can switch between the different implemented circuits. The maximum clock frequency is important for the performance during the operation of a circuit.

We point out that both our tool flow and the conventional DPR flow have the same gains in area compared to a static implementation of the circuits on the FPGA. Instead of implementing the different circuits next to each other on the FPGA, only an FPGA region that can contain the biggest circuit is needed.

6.2.1 Experimental set-up

Integration in the VTR framework The novel tool flow, depicted in Fig. 3b, was implemented in the latest version of the VTR (Verilog-To-Routing) framework [41], which incorporates the latest 6.0 version of VPR (Versatile Place and Route). Within this framework, StaticRoute was based on the wire-length driven router of VPR 6.0 [10, 32].

In the section above we described a new routing methodology for DPR of FPGAs, which could be integrated in any FPGA router. We chose to build in our novel tool flow into the VTR framework, because this has several advantages. First, the built-in verification algorithms can be re-used for our tool flow. VTR for example checks that a routing describes a properly connected tree for each net and that this tree connects all the pins used by that net. It also checks that no routing resources are overused (the occupancy of everything is recomputed from scratch) [41]. Second, VTR has a built-in timing analyzer and the provided architecture file contains timing information. Therefore it is possible to get representative timing information. For each implementation of a circuit it is possible to get the maximum attainable clock frequency. Finally, because the FPGA architecture is represented in a standard XML format, it is possible to try out different architectures more easily.

We chose to do the experiments in VTR primarily because a built-in timing analyser is available that makes it possible to extract the maximum attainable clock frequency of an implemented circuit. An additional benefit of the VTR framework is that an architecture file is available, which is strongly based on a commercially available FPGA, namely the Altera Stratix IV FPGA [8]. We use this FPGA architecture to conduct our experiments.

Another alternative, the Rapidsmith framework, makes it possible to implement the circuits on a commercial (Xilinx) FPGA, but in this framework timing information is not provided [27]. It would therefore be impossible to investigate the impact of our technique on the maximum clock frequency of the circuits.

FPGA architecture The architecture based on the Altera Stratix IV is described in sample_arch.xml and included in the distribution of VTR. This is the architecture we use in our experiments. It has configurable logic blocks (CLBs) containing 10 6-LUTs and the wire segments in the interconnection network span four logic blocks. It uses Wilton switch blocks [35] and unidirectional wires [28]. Altera’s Chip Planner Tool was used to determine the channel width of the Altera Stratix IV [7], which is 248. This way the experiments are carried out as close as possible to the actual FPGA chip.

We stress that the techniques and tools we use in this paper are independent of the FPGA architecture used. The number of inputs of the LUTs is simply an input parameter of the tool flow. Also different routing architectures can be used since StaticRoute uses a straightforward extension of a standard representation of the routing infrastructure called the routing resource graph. Because of this it would also be possible to target Xilinx FPGAs. In Sect. 6.1 results are presented for a more simple 4-LUT based architecture, which contains one LUT per CLB and wires that span only one logic block.

Since there is no other functionality implemented on the FPGA, the reconfigurable region comprises the complete FPGA in our experiments. The minimum square area of the FPGA was chosen to fit all circuits under consideration.

Choice of static configuration frames Unfortunately, detailed information on how the frames in the configuration memory are built up is not provided by the FPGA manufacturers. This is considered proprietary information and is not disclosed. How the configuration memory is organized in frames also differs between FPGAs. Given the limitations of physical design of the FPGA, it is however likely that internal structures which are close together in the FPGA architecture, will be controlled by memory cells which are close together in the configuration memory. For the sake of the experiments, we therefore assume that the frames coincide with the connection blocks (CBs) and the switch blocks (SBs). The size of an SB frame in our architecture contains around 1300 bits which is approximately the same as the frame size for a commercially available FPGA, for example the Virtex V FPGA from Xilinx which has a frame size of 1312 bits [52]. The size of a CB frame in our architecture is of the same order of magnitude, but considerably smaller, namely around 600 bits. An important reason to choose the frames as such, is that this way we are able to research the interaction between these fundamental routing structures and our novel method. We can among others investigate which of these structures is most suitable to make static or is better kept dynamic.

To investigate the impact of marking routing blocks static, we compare the cases where the fraction of both the static SBs and static CBs varies between 0 and 75 % in steps of 25 %. This gives us 16 cases organized in a two-dimensional table as can be seen in Table 4. In doing so, we can examine how these structures are best marked static. Figure 12 shows an example where 50 % of the switch blocks are marked static. As can be seen, this is done in such a way that the selected blocks are spread uniformly over the FPGA’s area. The cases where either 100 % of the connection blocks or switch blocks are marked static are omitted from the table because StaticRoute was not able to find a solution for these cases. This already indicates that some dynamic flexibility is needed in both the switch and connection blocks.
Fig. 12

An example of a 3\(\times \)3 island style FPGA where 50 % of the switch blocks are marked static (in grey)

Benchmarks For each of the cases above, that represents a relative portion of the CBs and SBs marked static, we conducted 20 experiments. In each of these experiments 2 circuits were randomly chosen out of the 20 largest circuits of the MCNC benchmark suite [53]. The circuits were chosen randomly to demonstrate that our novel tool flow can be used maintaining the possibility to implement completely unrelated circuits in the DPR region. These benchmarks can also be found in the ’benchmarks/blif’ folder of the VTR framework. Their size varies from around 1000 6-LUTs to around 5000 6-LUTS. Because we are considering dynamic partial reconfiguration, we are only looking at the reconfiguration of part of the FPGA. We have therefore purposely chosen moderately sized circuits. The chosen circuits for each experiment are presented in Tables 5, 9 and 10.

6.2.2 Results

Impact on reconfiguration time In both the case of conventional DPR as well as our new technique the bit streams necessary for configuration are computed off-line. This is done by running the tool flows shown in Fig. 3. During run-time the configuration bit streams only need to be downloaded to the FPGA’s configuration memory when it is needed to switch between circuits. As the configuration memory of the FPGA is written at constant bandwidth, we assume the reconfiguration time is proportional to the number of configuration bits in the bit streams. Configuration bit streams also include instructions, but these only take up a negligible part. To obtain the reduction of reconfiguration time in our experiments we therefore calculate the size of the configuration bit stream of the reconfigurable region, for both the conventional DPR case as for our technique.

The configuration bit stream in the case of conventional DPR reconfigures all the routing and configurable logic blocks (CLBs) in the reconfigurable region (RR). Its size is therefore given by:
$$\begin{aligned} B_{conv} = {B^{CLB}_{conv}} + {B^{R}_{conv}} \end{aligned}$$
(12)
where \({B^{CLB}_{conv}}\) and \( {B^{R}_{conv}}\) are the number of configuration bits in the reconfigurable region that control the CLBs and the routing, respectively.
$$\begin{aligned} {B^{CLB}_{conv}} = {B^{CLB}_{total}} = L_{CLB} *64+\underset{m \in M_{CLB}}{\sum } C_m \end{aligned}$$
(13)
where \(L_{CLB}\) is the total number of LUTs contained in the CLBs of the RR, \(M_{CLB}\) is the set of multiplexers found in the cross bar switches of the CLBs and \(C_m\) is the number of configuration bits needed to control one multiplexer. The number of LUTs is multiplied by 64 as our architecture contains LUTs with 6 inputs. We note that in our technique it does not matter how the CLB bits are divided into frames, as these are always completely overwritten during run-time. \(C_m\) is given by
$$\begin{aligned} C_m = 2* \lceil \sqrt{I_m} \rceil \end{aligned}$$
(14)
where \(I_m\) is the number of inputs of the routing multiplexer. The upper bound takes non-square inputs into consideration. This formula is used because all multiplexers are considered to be two-level multiplexers as is the case in current commercially available FPGAs [29, 46]. An example of a two-level 16:1 multiplexer is shown in Fig. 13a. This is an efficient implementation as the multiplexers of the first level share control bits and hence less SRAM cells are required. For the implementation of a multiplexer in one level we also assume the use of pass transistors and one hot encoding [46]. An example is shown in Fig. 13b for a 4:1 multiplexer.
Fig. 13

a An example of a two-level 16:1 multiplexer controlled by 8 SRAM bits. b Implementation of a 4:1 multiplexer using pass-transistors and one hot encoding

The number of routing configuration bits that needs to be reconfigured using conventional DPR is given by
$$\begin{aligned} {B^{R}_{conv}} = \underset{f \in F}{\sum } b_f \end{aligned}$$
(15)
where \(F\) is the set of routing configuration frames of the RR that contain at least one dynamic bit, \(b_f\) is the number of bits per routing frame given by Eq. 16. We stress that \(F\) only includes routing frames that contain at least one dynamic bit. If a frame contains only static bits it is not counted in the configuration size.
$$\begin{aligned} b_f = \underset{m \in M_F}{\sum } C_m \end{aligned}$$
(16)
where \(M_F\) is the set of multiplexers that is controlled by the bits in frame F.
In our novel technique we can also divide the configuration into a part for the CLBs and a part for the routing:
$$\begin{aligned}&\displaystyle B_{new} = {B^{CLB}_{new}} + {B^{R}_{new}}\nonumber \\&\displaystyle {B^{CLB}_{new}} = {B^{CLB}_{total}} \end{aligned}$$
(17)
As mentioned above in both the case of conventional DPR and our novel technique all CLBs in the reconfigurable region are rewritten. That is why \({B^{CLB}_{new}}\) also equals \({B^{CLB}_{total}}\) and is thus also given by Eq. 13.

However, in contrast with conventional DPR, we divide the routing frames of the RR into a set of static frames \(F_S\) and a set of dynamic frames \(F_D\), with \(F_S \bigcup F_D = F\). A reduction of reconfiguration time is achieved in our case because only the dynamic part of the configuration memory needs reconfiguration.

The content of the static frames is the same for all implemented circuits and therefore never needs to be rewritten in the configuration memory during run-time. Again, only routing frames that contain at least one dynamic bit are included in \(F_D\).The size of the bit stream in our technique hence is
$$\begin{aligned} {B^{R}_{new}} = \underset{f \in F_D}{\sum } b_f \end{aligned}$$
(18)
where \(b_f\) is the number of bits per routing frame and is calculated the same as above (i.e., using Eq. 16). Finally, the reduction of reconfiguration time (RRT) is calculated as
$$\begin{aligned} RRT = -\frac{(B_{conv}-B_{new})}{ B_{conv}} = \frac{B_{new}}{B_{conv}}-1 \end{aligned}$$
(19)
This way a reduction of reconfiguration time results in a negative value.
In Table 3 the absolute values obtained in the experiments with 2 circuits (case with 50 % static SBs and 50 % static CBs) are presented for \(B^{CLB}_{total}\), \(B^{R}_{total}\), \(B^{R}_{conv}\), \(B_{conv}\), \(B^{R}_{new}\) and \(B_{new}\). A column containing the fraction \(\frac{B^{R}_{conv}}{B^{R}_{total}}\) was added. \({B^{R}_{total}}\) is the total number of configuration bits present in the configuration memory of the routing of the reconfigurable region, independent of the fact whether the bits are static or dynamic. This column clearly shows that, in the case of conventional DPR, the dynamic bits are scattered over, on average, 94 % of the frames of the reconfigurable region. Only 6 % of the frames accidentally happen to have the same bit values for the different circuits. These are mostly frames that are not used by both circuits.
Table 3

Absolute values of the number of configuration bits in the reconfigurable region for the experiments with 2 circuits (case with 50 % static SBs and 50 % static CBs)

Experiment

\(B^{CLB}_{total}\) (bits)

\(B^{R}_{total}\) (bits)

\(B^{R}_{conv}\) (bits)

\(\frac{B^{R}_{conv}}{B^{R}_{total}}\) (%)

\(B_{conv}\) (bits)

\(B^{R}_{new}\) (bits)

\(B_{new}\) (bits)

\(RRT\) (%)

0

135520

308800

289404

94

424924

146198

281718

\(-\)34

1

542080

1229004

1192694

97

1734774

596126

1138206

\(-\)34

2

542080

1229004

1183194

96

1725274

591250

1133330

\(-\)34

3

362880

834968

813972

97

1176852

407024

769904

\(-\)35

4

219520

505660

462724

92

682244

231890

451410

\(-\)34

5

323680

745342

729990

98

1053670

366768

690448

\(-\)34

6

592480

1339474

1253246

94

1845726

630004

1222484

\(-\)34

7

161280

377924

354840

94

516120

177130

338410

\(-\)34

8

592480

1339474

1255156

94

1847636

630736

1223216

\(-\)34

9

448000

1023724

975330

95

1423330

484592

932592

\(-\)34

10

592480

1339474

1294606

97

1887086

649386

1241866

\(-\)34

11

219520

505660

463256

92

682776

233322

452842

\(-\)34

12

189280

439182

392408

89

581688

196582

385862

\(-\)34

13

362880

834968

790432

95

1153312

392156

755036

\(-\)35

14

542080

1229004

1186302

97

1728382

593142

1135222

\(-\)34

15

493920

1123754

1043826

93

1537746

524450

1018370

\(-\)34

16

286720

663204

567530

86

854250

288734

575454

\(-\)33

17

189280

439182

389976

89

579256

197074

386354

\(-\)33

18

362880

834968

797634

96

1160514

397902

760782

\(-\)34

19

493920

1123754

1042322

93

1536242

523820

1017740

\(-\)34

    

Avg: 94

   

Avg: \(-\)34

Table 3 corresponds to the case where 50 % of the connection blocks and 50 % of the switch blocks are marked static. The dynamic routing bits are clustered in the remaining dynamic part of the reconfigurable region, resulting in a reduction of routing reconfiguration time of 50 %. In Table 3 we see that this corresponds to a reduction of total reconfiguration time (RRT) of 34 %.

Columns RRT in Table 4 present the reduction of reconfiguration time for all the 16 cases considered in the experiments. These are values which are an average of the values obtained in the 20 experiments conducted for each case.
Table 4

Reduction of reconfiguration time (RRT) and average decrease in maximum clock frequency (Clk), in % relative to conventional DPR

 

% static CB

 

0 %

25 %

50 %

75 %

 

RRT

Clk

RRT

Clk

RRT

Clk

RRT

Clk

% static SB

        

   0 %

0

1

\(-\)7

\(-\)2

\(-\)14

\(-\)3

\(-\)22

\(-\)8

   25 %

\(-\)10

\(-\)1

\(-\)18

\(-\)3

\(-\)24

\(-\)5

\(-\)32

\(-\)9

   50 %

\(-\)20

\(-\)4

\(-\)27

\(-\)5

\(-\)34

\(-\)6

\(-\)41

\(-\)15

   75 %

\(-\)29

\(-\)9

\(-\)37

\(-\)11

\(-\)44

\(-\)18

\(-\)51

\(-\)21

 

(in %)

In the formulas above we see that, as the size of the static part increases, the size of the dynamic part decreases and this results in a larger reduction of reconfiguration time.

Note that in our technique all CLBs will be rewritten completely during run-time. The actual reduction is therefore mostly dependent on the relative size of the configuration memory dedicated to the different components of the reconfigurable fabric, namely the CLBs, the connection blocks and the switch blocks. For the FPGA architecture we use, 30 % of the total number of configuration bits is used for CLBs, 38 % for switch blocks and 32 % for connection blocks. If, for example, none of the switch blocks and all connection blocks are marked static, this will result in about 32 % reduction of the reconfiguration time. Marking 0 % of the switch blocks and only 50 % of the connection blocks static, results in a reduction of around 16 %. The actual reduction is 14 %. This can be clearly seen in Table 4.

Compared to the results in Sect. 6.1, we can see that the reduction in reconfiguration time in the new, more complex 6-LUT architecture is a bit smaller than in the more simple 4-LUT architecture. This is logical since the 6-LUT architecture, based on the commercially available Altera Stratix IV FPGA, contains more complex CLBs and has a higher CLB to routing ratio. In the 4-LUT architecture the reduction in reconfiguration time (RRT) for the MCNC20 benchmarks is around 30 % when 50 % of switch blocks is marked static and 40 % for 75 % static switch blocks. For the 6-LUT architecture this is 20 and 29 %, respectively, as can be seen in Table 4. Note that in both cases 0 % of the connection blocks is marked static.

Impact on maximum clock frequency In our proposed tool flow the different circuits are not implemented separately, as is the case in the conventional DPR flow. Instead, the circuits are routed simultaneously using StaticRoute. In this section we assess the impact this has on the maximum attainable clock frequency of the circuits. For each circuit we compare the implementation with the conventional DPR flow to the one resulting from using StaticRoute.

The results depend on how the connection and switch blocks are marked static. The decreases in maximum clock frequency for the 16 cases considered in the experiments are shown in column \(Clk\) of Table 4. The results shown here are each an average of the 20 experiments that have been carried out. The actual clock frequencies of the experiments for the case of 50 % static SBs and 50 % static CBs are presented in Table 5. The \(F_{max}\) columns show the clock frequencies for the implementations that use the conventional DPR tool flow. Clock frequencies obtained in our new tool flow using StaticRoute, are denoted with \(F_{max}^S\).
Table 5

Absolute values of the maximum attainable clock frequencies for the experiments with 2 circuits (case with 50 % static SBs and 50 % static CBs)

Experiment

Circuit 0

Circuit 1

\(F_{max}\) 0

\(F_{max}\) 1

\(F_{max}^S\) 0

\(F_{max}^S\) 1

Diff. 0

Diff. 1

(MHz)

(MHZ)

(MHZ)

(MHZ)

(%)

(%)

0

s298

ex5p

168

254

156

231

\(-\)7

\(-\)9

1

s38584.1

ex1010

243

169

224

158

\(-\)8

\(-\)7

2

misex3

s38584.1

219

243

211

231

\(-\)4

\(-\)5

3

frisc

dsip

119

320

116

285

\(-\)3

\(-\)11

4

diffeq

bigkey

209

288

195

271

\(-\)7

\(-\)6

5

elliptic

ex5p

140

258

136

230

\(-\)3

\(-\)11

6

s38417

diffeq

191

215

181

195

\(-\)5

\(-\)9

7

alu4

ex5p

220

259

210

241

\(-\)5

\(-\)7

8

clma

tseng

139

215

131

196

\(-\)6

\(-\)9

9

apex4

ex1010

230

167

203

155

\(-\)12

\(-\)7

10

clma

des

139

278

134

254

\(-\)4

\(-\)9

11

bigkey

alu4

288

237

272

220

\(-\)6

\(-\)7

12

alu4

seq

230

222

211

198

\(-\)8

\(-\)11

13

spla

tseng

166

219

154

211

\(-\)7

\(-\)4

14

diffeq

s38584.1

209

243

203

233

\(-\)3

\(-\)4

15

pdc

apex2

151

198

145

174

\(-\)4

\(-\)12

16

des

apex2

314

185

284

178

\(-\)10

\(-\)4

17

s298

seq

151

225

147

211

\(-\)3

\(-\)6

18

dsip

spla

320

166

292

150

\(-\)9

\(-\)10

19

pdc

misex3

161

222

156

213

\(-\)3

\(-\)4

   

Avg: 215

Avg: 200

Avg: \(-\)6

Figure 14 shows the values of Table 4 in a scatter plot. The pareto-optimal points are marked in black. Again we see that the relative decrease in maximum clock frequency depends on the relative size of the static part. As the size of the static part increases, the reconfiguration time decreases and the maximum clock frequency decreases more. Up until a decrease in reconfiguration time of around 34 % the impact on maximum clock frequency is less significant. When we cross this value, the maximum clock frequency decreases faster.
Fig. 14

Trade-off between reconfiguration time and maximum clock frequency

We can also see in Table 4 that it is better to spread the static part over the connection blocks and switch blocks. When, for example, 75 % of the connection blocks are marked static then the decrease in maximum clock frequency is on average 8 %. If, however, 50 % of the connection blocks and 50 % of the switch blocks are marked static, then the average decrease is smaller, namely 6 %.

For this last case the decrease in reconfiguration time is even higher, as shown in Table 4. In Fig. 14 we can also see that an implementation is more likely to be pareto-optimal if the static part is spread more evenly over the SBs and CBs.

We see that the discussion above is even more pronounced for the maximum values found in Table 6. The minimum values are also reported in Table 6. When the static part is small or zero we see that there are some small positive values. This means that in some cases StaticRoute turned out to have a slightly better maximum clock frequency than the pathfinder algorithm. This is possible because both algorithms are heuristics.
Table 6

Maximum and minimum decrease in maximum clock frequency, in % relative to conventional DPR

 

% static CB

 

0 %

25 %

50 %

75 %

 

Max

Min

Max

Min

Max

Min

Max

Min

% static SB

        

   0 %

\(-\)2

3

\(-\)3

1

\(-\)7

\(-\)1

\(-\)13

\(-\)5

   25 %

\(-\)4

2

\(-\)5

\(-\)1

\(-\)9

\(-\)4

\(-\)18

\(-\)6

   50 %

\(-\)9

0

\(-\)10

\(-\)2

\(-\)12

\(-\)3

\(-\)24

\(-\)8

   75 %

\(-\)14

\(-\)4

\(-\)19

\(-\)6

\(-\)31

\(-\)9

\(-\) 37

\(-\)8

 

(in %)

If the static part in the CBs or SBs is not chosen higher than 50 %, then the decrease in maximum clock frequency is not higher than 6 % on average and maximum 12 %. The decrease is 10 % on average and maximum 15 %, if only the SBs or CBs have a static part of 75 %. The decrease is 20 % on average and maximum 37 % if both CBs and SBs have a static part of 75 %. As mentioned earlier, StaticRoute was not able to find a solution when all the CBs and/or SBs were completely marked static.

We can conclude that, for 2 circuits, the impact on the maximum clock frequency is limited if the static part in the SBs or CBs is not higher than 50 %. It is also better to spread the static part evenly over the CBs and SBs. This way the same reduction of reconfiguration time is achieved (or more), while the impact on the maximum clock frequency is less significant.

We see here that different, but similar results are obtained compared to the more simple 4-LUT architecture in Sect. 6.1. Remember that in Sect. 6.1 we looked at the increase in total wire length of the circuit which is only a metric for performance, whereas in this section we discussed the actual decrease in maximum clock frequency, as can be seen in Table 4 and Table 6. Also for the 4-LUT architecture only 10 experiments were done using 5 of the 20 MCNC20 benchmarks, as opposed to 20 experiments using all 20 MCNC20 benchmarks for the 6-LUT architecture. The increase in wire length in the 4-LUT architecture is on average 2 % for both the cases of 50 and 75 % static switch blocks (see Fig. 11). For the more complex 6-LUT architecture the decrease in maximum clock frequency in absolute numbers is a bit higher with 4 and 9 %, respectively. The maximum increase in wire length and maximum decrease in clock frequency are more similar. For the 4-LUT architecture this is 5 and 14 % and in the 6-LUT architecture this is 9 and 14 %, respectively. It is difficult to say what the actual impact is of this decrease in maximum clock frequency on the application. This will depend on the type of application. There are applications that do not run at their maximum performance, because system requirements are not that stringent. Also, since FPGAs are used a lot for parallel applications, they sometimes rely more on massive parallelism than on high clock frequencies for performance.

7 Implementing more than 2 circuits

The previous section dealt with how the static part is best selected in the configuration memory. Only implementations that time-multiplexed 2 circuits on the same reconfigurable region were considered. In this section we will look at what the overhead is when our new flow using StaticRoute implements more than 2 circuits into the same reconfigurable region.

7.1 Experimental set-up

The FPGA architecture used in this section is the same as the one described in Sect. 6.2.1.

Again, we randomly choose N circuits out of the 20 largest MCNC benchmarks[53]. The number of circuits N is varied between 2 and 4. This is repeated 20 times for each value of N. The randomly chosen circuits for N = 2 were already presented in Table 5 in Sect. 6.2.2. The circuits for N = 3 and N = 4 can be found in Tables 9 and 10, respectively. In the experiments StaticRoute was not able to find a DPR solution with static parts when N was greater than 4. Based on the results from the previous section, in this section only the case is considered where 50 % of the SBs and 50 % of the CBs were marked static.

7.2 Results

7.2.1 Impact on reconfiguration time

The reduction of reconfiguration time is calculated as explained in Sect. 6.2.2. The absolute values obtained in the experiments for N = 2 were already presented in Table 3 in Sect. 6.2.2. The results for N = 3 and N = 4 are presented in Tables 7 and 8, respectively. Column RRT shows the reduction of total reconfiguration time for each experiment. On the bottom of this column the average value is presented.
Table 7

Absolute values of the number of configuration bits in the reconfigurable region for the experiments with 3 circuits (case with 50 % static SBs and 50 % static CBs)

Experiment

\(B^{CLB}_{total} (bits)\)

\(B^{R}_{total}\) (bits)

\(B^{R}_{conv}\) (bits)

\(\frac{B^{R}_{conv}}{B^{R}_{total}}\) (%)

\(B_{conv}\) (bits)

\(B^{R}_{new}\) (bits)

\(B_{new}\) (bits)

\(RRT\) (%)

0

362880

834968

814276

98

1177156

407936

770816

\(-\)35

1

542080

1229004

1183194

96

1725274

590566

1132646

\(-\)34

2

362880

834968

784344

94

1147224

388518

751398

\(-\)35

3

448000

1023724

981674

96

1429674

491538

939538

\(-\)34

4

219520

505660

472676

93

692196

237882

457402

\(-\)34

5

592480

1339474

1250760

93

1843240

627214

1219694

\(-\)34

6

362880

834968

810054

97

1172934

406298

769178

\(-\)34

7

362880

834968

792518

95

1155398

393646

756526

\(-\)35

8

592480

1339474

1295450

97

1887930

649926

1242406

\(-\)34

9

592480

1339474

1277194

95

1869674

641528

1234008

\(-\)34

10

592480

1339474

1290064

96

1882544

644290

1236770

\(-\)34

11

542080

1229004

1186834

97

1728914

595260

1137340

\(-\)34

12

592480

1339474

1242988

93

1835468

625080

1217560

\(-\)34

13

592480

1339474

1252764

94

1845244

631612

1224092

\(-\)34

14

592480

1339474

1260918

94

1853398

635036

1227516

\(-\)34

15

542080

1229004

1193506

97

1735586

598026

1140106

\(-\)34

16

323680

745342

732460

98

1056140

367414

691094

\(-\)35

17

362880

834968

817792

98

1180672

410326

773206

\(-\)35

18

362880

834968

807504

97

1170384

405158

768038

\(-\)34

19

219520

505660

470966

93

690486

235070

454590

\(-\)34

    

Avg: 96

   

Avg: \(-\)34

Table 8

Absolute values of the number of configuration bits in the reconfigurable region for the experiments with 4 circuits (case with 50 % static SBs and 50 % static CBs)

Experiment

\(B^{CLB}_{total}\) (bits)

\(B^{R}_{total}\) (bits)

\(B^{R}_{conv}\) (bits)

\(\frac{B^{R}_{conv}}{B^{R}_{total}}\) (%)

\(B_{conv}\) (bits)

\(B^{R}_{new}\) (bits)

\(B_{new}\) (bits)

\( RRT\) (%)

0

323680

745342

730902

98

1054582

366002

689682

\(-\)35

1

592480

1339474

1267310

95

1859790

638668

1231148

\(-\)34

2

448000

1023724

992204

97

1440204

496648

944648

\(-\)34

3

362880

834968

812182

97

1175062

406682

769562

\(-\)35

4

592480

1339474

1296204

97

1888684

651440

1243920

\(-\)34

5

323680

745342

732460

98

1056140

367110

690790

\(-\)35

6

362880

834968

810054

97

1172934

406336

769216

\(-\)34

7

592480

1339474

1296026

97

1888506

652396

1244876

\(-\)34

8

592480

1339474

1280810

96

1873290

649328

1241808

\(-\)34

9

286720

663204

581308

88

868028

293556

580276

\(-\)33

10

542080

1229004

1184598

96

1726678

591908

1133988

\(-\)34

11

219520

505660

478984

95

698504

239326

458846

\(-\)34

12

592480

1339474

1297576

97

1890056

653650

1246130

\(-\)34

13

286720

663204

584550

88

871270

295046

581766

\(-\)33

14

592480

1339474

1267918

95

1860398

638662

1231142

\(-\)34

15

592480

1339474

1263064

94

1855544

634352

1226832

\(-\)34

16

592480

1339474

1293554

97

1886034

653042

1245522

\(-\)34

17

592480

1339474

1267744

95

1860224

637892

1230372

\(-\)34

18

493920

1123754

1045530

93

1539450

529390

1023310

\(-\)34

19

362880

834968

798992

96

1161872

399040

761920

\(-\)34

    

Avg: 95

   

Avg: \(-\)34

In these tables we see that the impact on reconfiguration time hardly depends on the number of implemented circuits, N. As we mentioned before, the reduction of reconfiguration time mostly dependent on the relative size of the configuration memory dedicated to the different components of the reconfigurable fabric, namely the CLBs, the connection blocks and the switch blocks. We again added a column that shows the fraction \(\frac{B^{R}_{conv}}{B^{R}_{total}}\). This column clearly shows that, in the case of conventional DPR, the dynamic bits are scattered over around 95 % of the frames of the reconfigurable region.

In the case of StaticRoute the static part does not need to be rewritten during run-time. In the experiments in this section the fraction of static SBs and static CBs were both chosen to be 50 %. This of course results in a reduction of routing reconfiguration time of 50 %. In Tables 37 and 8 we see that this corresponds to a significant decrease in total reconfiguration time of 34 %.

We also note that the memory needed to store the configuration bit streams will decrease as the configuration data of the static frames only needs to be stored once. This results in a reduction of the size of the configuration data of 17 % for N = 2, 23 % for N = 3 and 25 % for N = 4.

7.2.2 Impact on maximum clock frequency

In this section we take a look at how the maximum clock frequency is affected as the number of circuits N increases. As mentioned earlier, StaticRoute was not able to find a DPR solution with static parts when N is greater than 4.

The results obtained in all the experiments for this section can be found in Table 5 for N = 2 and in Tables 9 and 10 for N = 3 and N = 4, respectively. In these tables the maximum clock frequencies obtained using the conventional DPR flow are denoted as \(F_{max}\). The ones obtained using StaticRoute were denoted as \(F_{max}^S\). The last columns of these tables show the difference between the corresponding maximum clock frequencies.

Figure 15 shows the average values of these tables, namely the average decrease in maximum clock frequency compared to conventional DPR. Also the standard deviation of the experiments is shown on this figure using error bars. As discussed in Sect. 6.2.2, the decrease is 6 % on average for 2 circuits. The results show that the decrease is higher as the number of circuits N increases. It is 9 % on average for 3 circuits and 15 % for 4 circuits.
Fig. 15

Average and standard deviation of the decrease in maximum clock frequency as a function of the number of circuits

Table 9

Absolute values of the maximum attainable clock frequencies for the experiments with 3 circuits (case with 50 % static SBs and 50 % static CBs)

Exp.

Circ. 0

Circ. 1

Circ. 2

\(F_{max}\) 0(MHz)

\(F_{max}\) 1(MHz)

\(F_{max}\) 2(MHz)

\(F_{max}^S\) 0(MHz)

\(F_{max}^S\) 1(MHz)

\(F_{max}^S\) 2 (MHz)

Diff. 0 (%)

Diff. 1 (%)

Diff. 2 (%)

0

tseng

frisc

dsip

219

119

320

198

112

313

\(-\)10

\(-\)6

\(-\)2

1

apex4

s38584.1

seq

207

243

201

186

224

187

\(-\)10

\(-\)8

\(-\)7

2

spla

s298

apex2

166

176

183

151

148

170

\(-\)9

\(-\)16

\(-\)7

3

ex5p

seq

ex1010

258

207

167

220

175

141

\(-\)15

\(-\)15

\(-\)16

4

apex2

dsip

misex3

179

320

219

168

294

188

\(-\)6

\(-\)8

\(-\)14

5

s38417

apex2

apex4

191

181

226

180

167

196

\(-\)6

\(-\)8

\(-\)13

6

ex5p

s298

frisc

250

176

119

213

156

109

\(-\)15

\(-\)11

\(-\)8

7

seq

spla

tseng

213

166

219

178

156

198

\(-\)16

\(-\)6

-10

8

des

s38417

bigkey

278

191

313

267

176

282

\(-\)4

\(-\)8

\(-\)10

9

bigkey

elliptic

clma

313

135

139

271

130

133

\(-\)13

\(-\)4

\(-\)4

10

s38417

s38584.1

elliptic

191

200

135

179

188

124

\(-\)6

\(-\)6

\(-\)8

11

frisc

s38584.1

dsip

116

243

294

109

235

272

\(-\)6

\(-\)3

\(-\)7

12

ex5p

s38417

s298

255

191

172

220

179

156

\(-\)14

\(-\)6

\(-\)9

13

frisc

pdc

s38417

118

156

191

107

140

179

\(-\)9

\(-\)10

\(-\)6

14

clma

tseng

misex3

139

215

213

132

196

196

\(-\)5

\(-\)9

\(-\)8

15

s38584.1

pdc

dsip

243

142

294

228

137

275

\(-\)6

\(-\)4

\(-\)6

16

misex3

elliptic

bigkey

226

140

306

213

131

267

\(-\)6

\(-\)6

\(-\)13

17

apex4

frisc

des

223

119

278

194

112

263

\(-\)13

\(-\)6

\(-\)5

18

ex5p

des

spla

250

278

166

216

238

157

\(-\)14

\(-\)14

\(-\)5

19

bigkey

s298

dsip

288

174

320

272

151

272

\(-\)6

\(-\)13

\(-\)15

    

Avg: 210

Avg: 190

Avg: \(-\)9

Table 10

Absolute values of the maximum attainable clock frequencies for the experiments with 2 circuits (case with 50 % static SBs and 50 % static CBs)

Exp.

Circ. 0

Circ. 1

Circ. 2

Circ. 3

\(F_{max}\) 0 (MHz)

\(F_{max} \) 1 (MHz)

\(F_{max}\) 2 (MHz)

\(F_{max}\) 3 (MHz)

\(F_{max}^S\) 0 (MHz)

\(F_{max}^S\) 1 (MHz)

\(F_{max}^S\) 2 (MHz)

\(F_{max}^S\) 3 (MHz)

Diff. 0 (%)

Diff. 1 (%)

Diff. 2 (%)

Diff. 3 (%)

0

elliptic

apex2

diffeq

misex3

140

190

195

226

130

154

150

204

\(-\)7

\(-\)19

\(-\)23

\(-\)10

1

misex3

clma

alu4

dsip

213

139

226

320

174

115

202

288

\(-\)18

\(-\)17

\(-\)11

\(-\)10

2

diffeq

ex1010

bigkey

elliptic

190

167

313

148

166

153

243

131

\(-\)13

\(-\)8

\(-\)22

\(-\)11

3

bigkey

seq

frisc

apex4

335

213

119

223

269

166

107

191

\(-\)20

\(-\)22

\(-\)10

\(-\)14

4

des

frisc

clma

elliptic

278

118

139

135

251

101

125

118

\(-\)10

\(-\)14

\(-\)10

\(-\)13

5

pdc

s38584.1

s38417

des

156

200

191

278

127

169

150

226

\(-\)19

\(-\)16

\(-\)21

\(-\)19

6

apex4

seq

dsip

des

210

226

300

314

186

191

251

245

\(-\)11

\(-\)15

\(-\)16

\(-\)22

7

s38417

tseng

des

s298

191

215

278

172

177

178

250

138

\(-\)7

\(-\)17

\(-\)10

\(-\)20

8

s38417

s38584.1

ex5p

dsip

191

200

255

320

156

166

220

247

\(-\)18

\(-\)17

\(-\)14

\(-\)23

9

apex4

des

alu4

tseng

210

314

237

225

192

254

214

201

\(-\)9

\(-\)19

\(-\)10

\(-\)11

10

dsip

apex2

s38584.1

seq

294

193

243

201

245

167

222

181

\(-\)17

\(-\)13

\(-\)9

\(-\)10

11

bigkey

alu4

dsip

tseng

288

237

320

226

267

195

258

201

\(-\)7

\(-\)18

\(-\)19

\(-\)11

12

ex5p

dsip

misex3

clma

255

320

213

139

219

272

176

122

\(-\)14

\(-\)15

\(-\)17

\(-\)12

13

diffeq

tseng

s298

elliptic

195

215

174

140

179

191

156

130

\(-\)8

\(-\)11

\(-\)10

\(-\)7

14

s298

frisc

apex4

ex5p

176

119

223

250

139

104

200

217

\(-\)21

\(-\)13

\(-\)10

\(-\)13

15

clma

alu4

apex2

s298

139

226

181

172

118

176

152

141

\(-\)15

\(-\)22

\(-\)16

\(-\)18

16

alu4

s38417

ex5p

des

226

191

255

278

192

160

220

250

\(-\)15

\(-\)16

\(-\)14

\(-\)10

17

clma

diffeq

ex1010

ex5p

139

215

161

255

115

183

138

216

\(-\)17

\(-\)15

\(-\)14

\(-\)15

18

seq

alu4

pdc

tseng

204

241

151

226

173

210

126

198

\(-\)15

\(-\)13

\(-\)17

\(-\)12

19

diffeq

s298

elliptic

spla

200

176

149

166

161

147

130

145

\(-\)20

\(-\)16

\(-\)13

\(-\)13

     

Avg: 214

Avg: 182

Avg: \(-\)15

These values can also be seen in Table 11, together with the values of the minima, maxima and standard deviations. We see that the maximum values follow the same trend. They are 12 % for 2 circuits and increase to 16 % for 3 circuits and 23 % for 4 circuits. The minimum values in Table 11 indicate that there are also some circuits that are barely influenced by our technique.
Table 11

Results for decrease in maximum clock frequency of the implementations with several circuits, in % rel. to conventional DPR

No. of circuits

Average

Maximum

Minimum

Std. dev.

2

\(-\)6

\(-\)12

\(-\)3

2.78

3

\(-\)9

\(-\)16

\(-\)2

3.81

4

\(-\)15

\(-\)23

\(-\)7

4.23

(in %)

8 Conclusion

Dynamic partial reconfiguration (DPR) allows to time-multiplex different circuits on the same FPGA region, resulting in considerable area savings. However, the conventional DPR flow tends to result in long reconfiguration times needed to switch between circuits. To overcome this problem we introduced a novel router, called StaticRoute. It reduces the part of the FPGA that needs to be reconfigured by sharing a part of the routing’s configuration between circuits, resulting in static bits that do not need reconfiguration.

StaticRoute uses the notion of switch congestion. This occurs when a dynamic bit resides in the part of the configuration memory that is marked static. We showed that it is possible to detect dynamic bits during routing in the extended routing resource graph. StaticRoute is an extended version of the pathfinder algorithm, and is used to route all circuits simultaneously. It is able to resolve both wire and switch congestion. Therefore, using StaticRoute, the dynamic bits are no longer scattered over the complete configuration memory of the routing. Instead they are clustered in the dynamic part. To the best of our knowledge we are the first to propose such a method, which can be used in a frame-based reconfiguration approach. In this paper we have shown that the static fraction of the configuration memory is best spread out evenly over connection blocks and switch blocks and should not be chosen higher than 50 % for either one. Our results indicate that our new method can be used for typical commercially available SRAM-based FPGAs. In our experiments we have shown that, when the number of circuits is limited, a reduction of 50 % in routing reconfiguration time can be achieved while the impact on the maximum clock frequency is acceptable. This results in a significant reduction of 34 % of total reconfiguration time.

Footnotes

  1. 1.

    This is a simplification. The nodes can also represent logical pins or sources or sinks. These are treated in the same way [10].

Notes

Acknowledgments

Brahim Al Farisi is sponsored by IWT, Agency for Innovation through Science and Technology in Flanders. Karel Heyse is supported by a Ph.D. grant of the Flemish Fund for Scientific Research (FWO – Vlaanderen).

References

  1. 1.
    Bleeding edge threats website. http://www.bleedingthreats.net
  2. 2.
    Abouelella F, Davidson T, Meeus W, Bruneel K, Stroobandt D (2013) How to efficiently implement dynamic circuit specialization systems. ACM Trans Des Autom Electron Syst 38Google Scholar
  3. 3.
    Al Farisi B, Bruneel K, Cardoso JMP, Stroobandt D (2013) An automatic tool flow for the combined implementation of multi-mode circuits. In: Proceedings of the design, automation, and test in Europe conference and exhibition, Grenoble, France, pp 821–826Google Scholar
  4. 4.
    Al Farisi B, Bruneel K, Stroobandt D Staticroute: A novel router for the dynamic partial reconfiguration of fpgas (2013) In: 23rd IEEE international conference on field programmable logic and applications (FPL), IEEE, pp 1–7Google Scholar
  5. 5.
    Al Farisi B, Heyse K, Bruneel K, Stroobandt D (2011) Memory-efficient and fast run-time reconfiguration of regularly structured designs. In: 21st International conference on field programmable logic and applications, Chania, Crete, Greece, pp 171–176Google Scholar
  6. 6.
    Al Farisi B, Vansteenkiste E, Bruneel K, Stroobandt D (2013) A novel tool flow for increased routing configuration similarity in multi-mode circuits. In: Proceedings of IEEE computer society annual symposium on VLSI 2013 (ISVLSI13), Natal, Brazil, pp 96–101Google Scholar
  7. 7.
    Altera (2012) Engineering change management with the chip plannerGoogle Scholar
  8. 8.
    Altera (2014) Stratix IV device handbook. http://www.altera.com/literature/hb/stratix-iv/stx4_5v4.pdf
  9. 9.
    Becker T, Koester M, Luk W (2010) Automated placement of reconfigurable regions for relocatable modules. In: Proceedings of 2010 IEEE international symposium on circuits and systems (ISCAS), IEEE, pp 3341–3344Google Scholar
  10. 10.
    Betz V, Rose J, Marquardt A (eds) (1999) Architecture and CAD for deep-submicron FPGAs. Kluwer Academic Publishers, NorwellGoogle Scholar
  11. 11.
    Chavet C, Andriamisaina C, Coussy P, Casseau E, Juin E, Urard P, Martin E (2007) A design flow dedicated to multi-mode architectures for dsp applications. In: IEEE/ACM international conference on computer-aided design, 2007 (ICCAD 2007), IEEE, pp 604–611Google Scholar
  12. 12.
    Chen W, Wang Y, Wang X, Peng C (2008) A new placement approach to minimizing FPGA reconfiguration data. In: International conference on embedded software and systems ICESS’08, IEEE, pp 169–174Google Scholar
  13. 13.
    Claus C, Ahmed R, Altenried F, Stechele W (2010) Towards rapid dynamic partial reconfiguration in video-based driver assistance systems. In: Reconfigurable computing: architectures, tools and applications, Springer, pp 55–67Google Scholar
  14. 14.
    Compton K, Hauck S (2002) Reconfigurable computing: a survey of systems and software. ACM Comput Surv (CSUR) 34(2):171–210CrossRefGoogle Scholar
  15. 15.
    Cordone R, Redaelli F, Redaelli MA, Santambrogio MD, Sciuto D (2009) Partitioning and scheduling of task graphs on partially dynamically reconfigurable FPGAs. IEEE Trans Comput Aided Des Integr Circuits Syst 28(5):662–675CrossRefGoogle Scholar
  16. 16.
    Coussy P, Lhairech-Lebreton G, Heller D, Martin E (2010) Gaut-a free and open source high-level synthesis tool. In: IEEE design automation and test in Europe-university boothGoogle Scholar
  17. 17.
    Della Torre M, Malik U, Diessel O (2005) A configuration system architecture supporting bit-stream compression for FPGAs. In: Advances in computer systems architecture, Springer, pp 415–428Google Scholar
  18. 18.
    Diessel O, ElGindy H, Middendorf M, Schmeck H, Schmidt B (2000) Dynamic scheduling of tasks on partially reconfigurable FPGAs. In: IEE Proceedings computers and digital techniques, vol. 147, IET, pp 181–188Google Scholar
  19. 19.
    Duhem F, Muller F, Lorenzini P (2011) Farm: fast reconfiguration manager for reducing reconfiguration time overhead on FPGA. In: Reconfigurable computing: architectures, tools and applications, Springer, pp 253–260Google Scholar
  20. 20.
    Eto E (2003) Difference-based partial reconfigurationGoogle Scholar
  21. 21.
    Hansen SG, Koch D, Torresen J High speed partial run-time reconfiguration using enhanced ICAP hard macro. In: IEEE international symposium on parallel and distributed processing workshops and Phd forum (IPDPSW), IEEE, pp 174–180Google Scholar
  22. 22.
    Hariyama M, Muthumala WH, Kameyama M (2006) Dynamically reconfigurable gate array based on fine-grained switch elements and its CAD environment. In: IEEE Asian solid-state circuits conference, 2006 (ASSCC 2006), IEEE, pp 155–158Google Scholar
  23. 23.
    Heyse K, Al Farisi B, Bruneel K, Stroobandt D (2012) Automating reconfiguration chain generation for SRL-based run-time reconfiguration. In: Lectue notes in computer science, vol. 7199, Springer, Berlin, Germany, pp 1–12Google Scholar
  24. 24.
    Hubner M, Gohringer D, Noguera J, Becker J (2010) Fast dynamic and partial reconfiguration data path with low hardware overhead on Xilinx FPGAs. In: IEEE international symposium on parallel & distributed processing, Workshops and Phd forum (IPDPSW), IEEE, pp 1–8Google Scholar
  25. 25.
    Kalte H, Lee G, Porrmann M, Ruckert U (2005) Replica: a bitstream manipulation filter for module relocation in partial reconfigurable systems. In: Proceedings of the 19th IEEE international symposium on parallel and distributed processing, IEEE, p 151bGoogle Scholar
  26. 26.
    Koch D, Beckhoff C, Teich J (2009) Minimizing internal fragmentation by fine-grained two-dimensional module placement for runtime reconfiguralble systems. In: 17th IEEE symposium on field programmable custom computing machines, FCCM’09, IEEE, pp 251–254Google Scholar
  27. 27.
    Lavin C, Padilla M, Lamprecht J, Lundrigan P, Nelson B, Hutchings B (2011) Rapidsmith: do-it-yourself CAD tools for Xilinx FPGAs. In: International conference on field programmable logic and applications (FPL), IEEE, pp 349–355Google Scholar
  28. 28.
    Lemieux G, Lee E, Tom M, Yu A (2004) Directional and single-driver wires in FPGA interconnect. In: IEEE international conference on field-programmable technology, IEEE, pp 41–48Google Scholar
  29. 29.
    Lewis D, Ahmed E, Baeckler G, Betz V, Bourgeault M, Cashman D, Galloway D, Hutton M, Lane C, Lee A et al (2005) The Stratix II logic and routing architecture. In: Proceedings of the 2005 ACM/SIGDA 13th international symposium on field-programmable gate arrays, ACM, pp 14–20Google Scholar
  30. 30.
    Li Z, Hauck S (2002) Configuration prefetching techniques for partial reconfigurable coprocessor with relocation and defragmentation. In: Proceedings of the 2002 ACM/SIGDA tenth international symposium on field-programmable gate arrays, ACM, pp 187–195Google Scholar
  31. 31.
    Lindholm JV, McEwen IL, Young JT (2006) Routing with frame awareness to minimize device programming time and test cost. US Patent 7,149,997Google Scholar
  32. 32.
    Luu J, Rose J (2012) VPR 6.0 user manual. vtr-verilog-to-routing.googlecode.com/files/VPR\_User\_Manual\_6.0.pdfGoogle Scholar
  33. 33.
    Manet P, Maufroid D, Tosi L, Gailliard G, Mulertt O, Di Ciano M, Legat JD, Aulagnier D, Gamrat C, Liberati R et al (2008) An evaluation of dynamic partial reconfiguration for signal and image processing in professional electronics applications. EURASIP J Embedded Syst 2008:1Google Scholar
  34. 34.
    Marconi T, Hur JY, Bertels K, Gaydadjiev G (2010) A novel configuration circuit architecture to speedup reconfiguration and relocation for partially reconfigurable devices. In: IEEE 8th symposium on application specific processors (SASP), 2010, IEEE, pp 87–92Google Scholar
  35. 35.
    Masud M, Wilton S (1999) A new switch block for segmented fpgas. In: Lysaght P, Irvine J, Hartenstein R (eds) Field programmable logic and applications, vol 1673., Lecture notes in computer scienceSpringer, Berlin, pp 274–281CrossRefGoogle Scholar
  36. 36.
    McMurchie L, Ebeling C (1995) Pathfinder: a negotiation-based performance-driven router for FPGAs. In: Proceedings of the 1995 ACM third international symposium on field-programmable gate arrays, ACM, pp 111–117Google Scholar
  37. 37.
    Nava F, Sciuto D, Santambrogio MD, Herbrechtsmeier S, Porrmann M, Witkowski U, Rueckert U (2011) Applying dynamic reconfiguration in the mobile robotics domain: a case study on computer vision algorithms. ACM Trans Reconfigurable Technol Syst 4(3):29:1–29:22CrossRefGoogle Scholar
  38. 38.
    Papadimitriou K, Dollas A, Hauck S (2011) Performance of partial reconfiguration in FPGA systems: a survey and a cost model. ACM Trans Reconfigurable Technol Syst 4(4):36:1–36:24CrossRefGoogle Scholar
  39. 39.
    Prasad Raghuraman K, Wang H, Tragoudas S (2005) A novel approach to minimizing reconfiguration cost for lut-based FPGAs. In: 18th international conference on VLSI design, IEEE, pp 673–676Google Scholar
  40. 40.
    Raghuraman K, Wang H, Tragoudas S (2006) Minimizing FPGA reconfiguration data at logic level. In: Proceedings of the 7th international symposium on quality electronic design, IEEE Computer Society, pp 219–224Google Scholar
  41. 41.
    Rose J, Luu J, Yu CW, Densmore O, Goeders J, Somerville A, Kent KB, Jamieson P, Anderson J (2012) The VTR project: architecture and CAD for FPGAs from verilog to routing. In: Proceedings of FPGA, ACM, pp 77–86Google Scholar
  42. 42.
    Rousseau B, Manet P, Delavallée T, Loiselle I, Legat JD (2012) Dynamically reconfigurable architectures for software-defined radio in professional electronic applications. In: Design technology for heterogeneous embedded systems, Springer, pp 437–455Google Scholar
  43. 43.
    Rullmann M, Merker R (2006) Maximum edge matching for reconfigurable computing. In: 20th International parallel and distributed processing symposium, 2006 (IPDPS 2006), IEEEGoogle Scholar
  44. 44.
    Sedcole P, Blodget B, Becker T, Anderson J, Lysaght P (2006) Modular dynamic reconfiguration in Virtex FPGAs. Comput Digital Tech 153(3):157–164CrossRefGoogle Scholar
  45. 45.
    Shang L, Jha NK (2002) Hardware–software co-synthesis of low power real-time distributed embedded systems with dynamically reconfigurable FPGAs. In: Proceedings of the 2002 Asia and South Pacific design automation conference, IEEE Computer Society, p 345Google Scholar
  46. 46.
    Smith AM, Constantinides GA, Cheung PY (2009) Area estimation and optimisation of FPGA routing fabrics. In: International conference on field programmable logic and applications FPL 2009, IEEE, pp 256–261Google Scholar
  47. 47.
    Sourdis I, Bispo J, Cardoso J, Vassiliadis S (2008) Regular expression matching in reconfigurable hardware. J Signal Process Syst 51:99–121CrossRefGoogle Scholar
  48. 48.
    Tan H, DeMara RF (2006) A physical resource management approach to minimizing FPGA partial reconfiguration overhead. In: IEEE international conference on reconfigurable computing and FPGA’s, 2006 (ReConFig 2006), IEEE, pp 1–5Google Scholar
  49. 49.
    Trimberger S (1998) Scheduling designs into a time-multiplexed fpga. In: Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays, ACM, pp 153–160Google Scholar
  50. 50.
    Trimberger S, Carberry D, Johnson A, Wong J (1997) A time-multiplexed FPGA. In: Proceedings of the 5th annual IEEE symposium on field-programmable custom computing machines, 1997, IEEE, pp 22–28Google Scholar
  51. 51.
    Vipin K, Fahmy SA (2012) A high speed open source controller for FPGA partial reconfiguration. In: International conference on field-programmable technology (FPT), IEEE, pp 61–66Google Scholar
  52. 52.
    Xilinx (2012) UG191(v3.11): Virtex-5 FPGA user guide. XilinxGoogle Scholar
  53. 53.
    Yang S (1991) Logic synthesis and optimization benchmarks user guide: version 3.0. CiteseerGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Brahim Al Farisi
    • 1
  • Karel Heyse
    • 1
  • Karel Bruneel
    • 1
  • João Cardoso
    • 2
  • Dirk Stroobandt
    • 1
  1. 1.Computing Systems LabGhent UniversityGhentBelgium
  2. 2.University of PortoPortoPortugal

Personalised recommendations