Solving the Discrete Logarithm of a 113Bit Koblitz Curve with an FPGA Cluster
Abstract
Using FPGAs to compute the discrete logarithms of elliptic curves is a wellknown method. However, until to date only CPU clusters succeeded in computing new elliptic curve discrete logarithm records. This work presents a highspeed FPGA implementation that was used to compute the discrete logarithm of a 113bit Koblitz curve. The core of the design is a fully unrolled, highly pipelined, selfsufficient Pollard’s rho iteration function. An 18core Virtex6 FPGA cluster computed the discrete logarithm of a 113bit Koblitz curve in extrapolated 24 days. Until to date, no attack on such a large Koblitz curve succeeded using as little resources or in such a short time frame.
Keywords
Elliptic curve cryptography Discrete logarithm problem Koblitz curve Hardware design FPGA Discrete logarithm record1 Introduction
It is possible to repeatedly fold a standard lettersized sheet of paper at the midway point about six to seven times. In 2012, some MIT students [28] were able to fold an 1.2 Km long toilet paper 13 times. And every time the paper was folded, the number of layers on top of each other doubled. Therefore, the MIT students ended up with \(2^{13}=8192\) layers of paper on top of each other. And poor Eve’s job was to manually count all layers one by one.
Similar principles apply in cryptography, although bigger numbers are involved. In Elliptic Curve Cryptography (ECC), where \(\lceil \log _2{n} \rceil \)bit private keys are used, Eve does not have to iterate through all possible \(n\) keys. Instead, Eve would use the more efficient parallelizable Pollard’s rho algorithm that finishes in approximately \(\sqrt{n}\) steps. The omnipresent question is how big \(n\) has to be such that even the most powerful adversaries are not able to reconstruct a private key. Especially in embedded, costsensitive applications, it is important to use keys that are only as large as necessary.
Discrete logarithms over elliptic curves were computed in the past [9, 18] and several experimental baselines were established. Since then, committees [3, 6] steadily increased the minimal \(n\) by simply applying Moore’s law. However, it is necessary to practically compute discrete logarithms to check to which margin the current standards hold.
The task of computing a discrete logarithm can be split into the work done by researchers and the work done by machines. This paper presents both a novel hardware architecture and the discrete logarithm of a 113bit Koblitz curve. The highly pipelined, highspeed, and practically extensively tested ECC Breaker FPGA design was used to solve the discrete logarithm of a 113bit Koblitz curve in extrapolated^{1} 24 days using mere 18 FPGAs. Therefore, ECC breaker is the first FPGA design to be used to solve a large discrete logarithm. Further, based on ECC Breaker it is even possible to compute discrete logarithms of even larger binaryfield elliptic curves. Substantiated by practical experimentation, this paper will make a notable contribution to the community’s activity of breaking larger elliptic curves and carving new standards.
This paper is structured as follows: Sect. 2 gives an overview on related work. Section 3 revisits some mathematical foundations and Sect. 4 summarizes the experiments with different iteration functions. The most suitable iteration function was implemented in hardware, which is described in Sect. 5. As the design is flexible enough to attack larger elliptic curves, Sect. 6 gives runtime and cost approximations. Section 7 summarizes the learned lessons and Sect. 8 concludes the paper. Appendix gives an overview of the targeted curve parameters and pseudorandomly chosen target points.
2 Related Work
Certicom [10] introduced ECC challenges in 1997 to increase industry acceptance of cryptosystems based on the elliptic curve discrete logarithm problem (ECDLP). They published system parameters and point challenges for different security levels. Since then, the hardest solved Certicom challenges are ECCp109 for primefield based elliptic curves, done by Monico et al. using a cluster of about 10,000 computers (mostly PCs) for 549 days, and ECC2109 for binaryfield based elliptic curves, also done by Monico et al., computing on a cluster of around 2,600 computers (mostly PCs) for around 510 days. Harley et al. [18] solved an ECDLP over a 109bit Koblitzcurve Certicom challenge (ECC2K108) with public participation using up to 9,500 PCs in 126 days.
As the Certicom challenges lie far apart (ECCp131 is 2,048 times more complex than ECCp109), also selfgenerated challenges have been broken. A discrete logarithm defined over a 112bit primefield elliptic curve was solved by Bos et al. [9], utilizing 200 Playstation 3 s for 6 months. A single playstation reached a throughput of \(42 \cdot 10^6\) iterations per second (IPS). This work presents the discrete logarithm of a 113bit binaryfield Koblitz curve that used 18 FPGAs for extrapolated 24 days and reached a throughput of \(165 \cdot 10^6\) iterations per FPGA per second.
Further, several attempts and approximations were done in order to attack larger elliptic curves using FPGAs. Dormale et al. [24] targeted ECC2113, ECC2131, and ECC2163 using Xilinx Spartan 3 FPGAs performing up to \(20 \cdot 10^6\) IPS. Most promising is the work of Bailey et al. [5] who attempt to break ECC2K130 using Nvidia GTX 295 graphics cards, Intel Core 2 Extreme CPUs, Sony PlayStation 3s, and Xilinx Spartan 3 FPGAs. Their FPGA implementation has a throughput of \(33.7 \cdot 10^6\) IPS and was later improved by Fan et al. [12] to process \(111 \cdot 10^6\) IPS. Other FPGA architectures were proposed by Güneysu et al. [16], Judge and Schaumont [20], and Mane et al. [21]. Güneysu et al.’s Spartan 3 architecture performs about \(173 \cdot 10^3\) IPS, Judge and Schaumont’s Virtex 5 architecture executes \(2.87 \cdot 10^6\) IPS, and Mane et al.’s Virtex 5 architecture does \(660 \cdot 10^3\) IPS.
So far, none of their FPGA implementations have been successful in solving ECDLPs. This work on the other hand presents a practically tested architecture which can be used to attack both Koblitz curves and binaryfield Weierstrass curves.
3 Mathematical Foundations
To ensure a common vocabulary, it is important to revisit some of the basics. Hankerson et al. [17] and Cohen et al. [11] shall be consulted for further details.
3.1 Elliptic Curve Cryptography
This paper focuses on Weierstrass curves that are defined over binary extension fields \(K=\mathbb {F}_{2^m}\). The curves are defined as \(E/K: y^2 + xy = x^3 + ax^2 + b\), where \(a\) and \(b\) are system parameters and a tuple of \(x\) and \(y\) which fulfills the equation is called a point \(P=(x,y)\). Using multiple points and the chordandtangent rule, it is possible to derive an additive group of order \(n\), suitable for cryptography. The number of points on an elliptic curve is denoted as \(\#E(K)=h \cdot n\), where \(n\) is a large prime and the cofactor \(h\) is typically in the range of 2 to 8. The core of all ECCbased cryptographic algorithms is a scalar multiplication \(Q=kP\), in which the scalar \(k \in [0,n1]\) is multiplied with a point \(P\) to derive \(Q\), where both points are of order \(n\).
As computing \(Q=kP\) can be costly, a lot of research was done on the efficient and secure computation of \(Q=kP\). A subset of binary Weierstrass curves, known as Koblitz curves (also known as anomalous binary curves), have some properties which make them especially interesting for fast implementations. They may make use of a map \(\sigma (x,y) = (x^2, y^2),~\sigma (\infty ) = (\infty )\), an automorphism of order \(m\) known as a Frobenius automorphism. This means that there exists an integer \(\lambda ^j\) such that \(\sigma ^j(P) = \lambda ^j P\). Another automorphism, which is not only applicable to Koblitz curves, is the negation map. The negative of a point \(P=(x, y)\) is \(P=(n1)P=(x, x+y)\).
3.2 Elliptic Curve Discrete Logarithm Problem (ECDLP)
The security of ECC lies in the intractability of the ECDLP: Given the two points \(Q\) and \(P\), connected by \(Q=kP\), it should be practically infeasible to compute the scalar \(0 \le k < n\). As standardized elliptic curves are designed such that neither the PohligHellman attack [25], nor the Weil and Tate pairing attacks [13, 23], nor the Weil descent attack [15] apply, the standard algorithm to compute a discrete logarithm is Pollard’s rho algorithm [26].
The main idea behind Pollard’s rho algorithm is to define an iteration function \(f\) that defines a random cyclic walk over a graph. Using Floyd’s cyclefinding algorithm, it is possible to find a pair of colliding triples, each consisting of a point \(X_i\) and two scalars \(c_i\) and \(d_i\). As \(X_1 = X_2 = c_1P + d_1Q = c_2P + d_2Q\) and \((d_2  d_1)Q = (d_2  d_1) k P = (c_1  c_2) P\), it is possible to compute \(k = (c_1  c_2) (d_2  d_1)^{1}\mod n\). Pollard’s rho algorithm expects to encounter such a collision after \(\sqrt{\frac{\pi n}{2}}\) steps.
In order to parallelize the attack efficiently, Van Oorschot and Wiener [30] introduced the concept of distinguished points. Distinguished points are a subset of points, which satisfy a particular condition. Such a condition can be a specific number of leading zero digits of a point’s xcoordinate or a particular range of Hamming weights in normal basis. Those distinguished points are stored in a central database, but can be computed in parallel. The number of instances running in parallel is linearly proportional to the achievable speedup. Note that each instance starts with a random starting triple and uses one of the iteration functions \(f\) which are discussed in the following section.
4 Selecting the Iteration Function
As the iteration function will be massively parallelized and synthesized in hardware, it is crucial to evaluate different iteration functions and select the most suitable one. In this work, the iteration functions by Teske [29], Wiener and Zuccherato [31], Gallant et al. [14], and Bailey et al. [4] were checked for their practical requirements and achievable computation rates. Table 1 summarizes the experiments done in software on a 41bit Koblitz curve.
What all iteration functions have in common is that they update a state, henceforth referred to as triple, consisting of two scalars \(c_i,d_i \in [0..n1]\) and a point \(X_i = c_iP + d_iQ\). An iteration function \(f\) deterministically computes \(X_{i+1} = f(X_i)\) and updates \(c_{i+1}\) and \(d_{i+1}\) accordingly, such that \(X_{i+1} = c_{i+1}P + d_{i+1}Q\) holds. A requirement on \(f\) is that it should be easily computable and to have the characteristics of a random function.
Teske’s \(r\) adding walk [29] is a nearly optimal choice for an iteration function. It partitions the elliptic curve group into \(r\) distinct subsets \(\left\{ S_1, S_2, ..., S_r\right\} \) of roughly equal size. If a point \(X_i\) is assigned to \(S_j\), the iteration function computes \(f(X_i) = X_i + R[j]\), with \(R[]\) being an \(r\)sized table consisting of linear combinations of \(P\) and \(Q\). After approximately \(\sqrt{\frac{\pi n}{2}}\) steps, Teske’s \(r\)adding walk finds two colliding points for all types of elliptic curves.
The Frobenius automorphism of Koblitz curves can not only be used to speedup the scalar multiplication, but also to improve the expected runtime of a parallelized Pollard’s rho by a factor of \(\sqrt{m}\). Wiener and Zuccherato [31], Gallant et al. [14], and Bailey et al. [4] proposed iteration functions which should achieve this \(\sqrt{m}\)speedup.
Wiener and Zuccherato [31] proposed to calculate \(f(X_i) = \sigma ^l(X_i + R[j])\; \forall \; l \in [0, m1]\) and choose the point, which has the smallest \(x\)coordinate when interpreted as an integer. Gallant et al. [14] introduced an iteration function based on a labeling function \(\mathcal {L}\), which maps the equivalence classes defined by the Frobenius automorphism to some set of representatives. The iteration function is then defined as \(f(X_i) = X_i + \sigma ^l(X_i)\), where \(l = \mathrm hash _m(\mathcal {L}(X_i))\). Bailey et al. [4] suggested to compute \(f(X_i) = X_i + \sigma ^{(l \mod 8) + 3}(X_i)\) to reduce the complexity of the iteration function.
Implementation results of all tested iteration functions.

In order to make sure that the potential iteration functions work as promised, a 41bit Koblitz curve was used to evaluate the iteration functions with a C implementation on a PC (cf. Table 1). As labeling function \(\mathcal {L}\), the Hamming weight of the \(x\)coordinate in normal basis was used. The hash function was disregarded. Table 1 summarizes the average number of iterations (computing 100 ECDLPs) of all tested iteration functions using four parallel threads. The experiments showed that the average number of iterations of Gallant’s and Bailey’s iteration functions are 1324 % higher compared to the iteration function by Wiener and Zuccherato. Additionally, with a probability of 1420 % some of the parallel threads produced identical sequences of distinguished points. Restarting the threads regularly or ondemand would counter this problem. Not countering the problem of fruitless threads would increase the average runtime of Gallant’s iteration function by another 29 %.
As Wiener and Zuccherato’s iteration function achieved the best speed and does not have the problem of fruitless threads, it was chosen to be implemented in hardware. Additionally, by leaving out the automorphism, the hardware can be used to attack general binaryfield Weierstrass curves as well.
5 ECC Breaker Hardware
Before the actual hardware architecture and its most critical components are discussed, it is important to review the basic design assumptions and core ideas of the ECC Breaker design.
5.1 Basic Assumptions and Decisions
ASIC Vs FPGA Design. In literature it is possible to find a lot of FPGA and ASIC designs optimized for some crucial characteristic. Some authors even dare to compare FPGA and ASIC results. However, several of the largest components in ASIC designs, e.g., registers, RAM, or integer multipliers, are for free in an FPGA design. For instance, every slice comes with several registers. Therefore, adding pipeline stages in a logicheavy FPGA design is basically for free. For this paper, Xilinx Virtex6 ML605 evaluation boards were chosen because of availability. Note that all following design decisions were made to maximize the performance of ECC Breaker on this particular board.
Design Goals. As Pollard’s rho algorithm is perfectly parallelizeable, the design goal clearly is to maximize the throughput per (given) area. Note that the speed (iterations per seconds) of an attack is linearly proportional to the throughput and inversely proportional to the chip area (more instances per FPGA also increase the speed). Therefore, the most basic design decision was whether to go for many small or a single large FPGA design.
Core Idea. In earlier designs, many areaefficient architectures were considered, each coming with a single \(\mathbb {F}_{2^m}\) multiplier, \(\mathbb {F}_{2^m}\) squarer, and \(\mathbb {F}_{2^m}\) adder per instance. The main problems of these designs were the costly multiplexers and the low utilization of the hardware. Therefore the design principle of ECC Breaker is a single, fully unrolled, fully pipelined iteration function. In order to keep all pipeline stages busy, the number of pipeline stages equals the amount of triples processed within ECC Breaker.
ECC Breaker Versus Related Work. (i) In the current setup, the interface between ECC Breaker and desktop is a simple, slow, serial interface. This might be a challenge for related implementations but not for ECC Breaker. The implemented iteration function does not require mechanisms to detect fruitless cycles or threads and the onchip distinguished points (triple) storage assures that only distinguished triples have to be read. (ii) The proposal by Fan et al. [12] to perform simultaneous inversion and therefore save some finitefield multipliers was not picked because their proposal introduces many multiplexers and a complex control logic, and might not fully utilize the available hardware. (iii) Further, ECC Breaker stands on its own by coming with prime field \(\mathbb {F}_n\) arithmetic which has only a minor impact on the size of the hardware (\(<\) 3 %). Additionally, it proved indispensable during development that the generated distinguished triples could be easily verified.
Generalization of ECC Breaker. Although the current version of ECC Breaker is carefully optimized for a 113bit binaryfield Koblitz curve, the underlying architecture and design approach is also suitable for larger elliptic curves, e.g., a 131bit Koblitz curve. In Sect. 6, approximations of the expected runtimes and potential costs to attack such a larger curve are given.
5.2 The Architecture
The iteration function itself consists of several components: a point addition module, a point automorphism module, two \(\mathbb {F}_{n}\) adders, two \(\mathbb {F}_{n}\) multipliers, two blockRAM based tables and several blockRAM based FIFOs to care for datadependencies.
5.3 ECC Breaker Components
Figure 2 shows the implemented point addition module which directly maps the formulas from above. Two \(\mathbb {F}_{2^m}\) multiplier, one \(\mathbb {F}_{2^m}\) inverter, and five FIFOs are necessary to compute a point addition in 184 cycles. Note that it is not possible to get rid of the costly inversion as the result of the point addition must be available in affine coordinates (cf. [24]).
Point Automorphism Module. In order to speedup Pollard’s rho algorithm for Koblitz curves, it is necessary to uniquely map \(m\) points from the same equivalence class to a single point. As ECC Breaker follows Wiener and Zuccherato’s [31] approach of interpreting the field elements as integers and comparing them, it was necessary to design a module that does \(m\) squarings and \(m\) comparisons as efficiently as possible. This module relies on normal basis representation and is depicted in Fig. 3. It converts \(x\) and \(y\) into normal basis, finds the smallest \(x\) within the normal basis, rotates \(y\) appropriately, and transforms \(x\) and \(y\) back into a canonical polynomial representation. As the \(m\) exponents of \(x\) (\(x^{(2^{i})}\)) are computed by simple rewiring (\(x\) rotated by \(i\) steps), and the smallest \(x\) is found using a binary comparison tree, no canonical \(\mathbb {F}_{2^m}\) squarer is needed.
As optimization, only the \(t=70\) most significants bits of \(x\) are compared. This means that if two numbers with \(t\) equivalent most significant bits are compared, no unique minimum is found. However, the probability for that is only \(2^{t}\). For \(i=\sqrt{\frac{\pi n}{2m}}\) iterations and \(m \cdot i\) comparisons, the probability for not selecting the smaller value is only \(1(12^{t})^{m \cdot i} = 0.00081\) for \(m=113\).
In respect to the overall design, the point automorphism module requires 14 % of all slices and is about 5.6 times smaller than the point addition module (in terms of slices). The majority of the point automorphism module is the comparator tree. The basis transformations are fairly cheap and make up only 20 % of the point automorphism module.
\(\mathbb {F}_{2^m}\) inverse. An Euclideanbased inversion algorithm is not deterministic and therefore hard to compute with a pipelined hardware module. Therefore, ECC Breaker computes the inverse using Fermat’s little theorem; an inversion by exponentiation. Fortunately, an exponentiation with \(2^{m2}\) can be computed very efficiently using Itoh and Tsujii’s [19] exponentiation trick, needing \(112\) squarers and \(8\) multipliers for \(m=113\): \(a=a^{2^11} \rightarrow a^{2^21} \rightarrow a^{2^31} \rightarrow a^{2^61} \rightarrow a^{2^71} \rightarrow a^{2^{14}1} \rightarrow a^{2^{28}1} \rightarrow a^{2^{56}1} \rightarrow a^{2^{112}1} \rightarrow a^{2^{113}2}=a^{1}\).
\(\mathbb {F}_{2^m}\) normal basis. The advantage of a normal basis is that a squaring is a simple rotation operation. The disadvantage of a normal basis is that a \(\mathbb {F}_{2^m}\) multiplication is fairly complex to compute. ECC breaker uses per default a normal, canonical polynomial representation.
Only within the point automorphism module the normal basis rendered advantageous. The necessary matrix multiplication for a basis transformation can be implemented very efficiently. As the matrix is constant, on average \(m/2\) of the input signals are xored per output signal. Based on the results from Table 2, 666 LUTs are needed per basis transformation.
Experiments show that the normal basis could also reduce the area of the consecutive squaring units within the \(\mathbb {F}_{2^m}\) inverse. The 14, 28, and 56 squarers currently need 1 582, 3 164, and 6 328 LUTs, respectively. Doing two basis transformations and a rotation within normal basis would actually save area. Also, accumulating the two transformation matrices into a single matrix would further reduce the area. However, as all squarers together only need 11 % of all slices and 10 % of all LUTs, the potential area improvement is rather limited. Therefore, contrary to [5, 12], ECC Breaker only uses a normal basis number representation within the point automorphism module.
\(\mathbb {F}_{2^m}\) multiplier. As in total ten \(\mathbb {F}_{2^m}\) multipliers are needed for the point addition module and the \(\mathbb {F}_{2^m}\) inversion module, the \(\mathbb {F}_{2^m}\) multipliers have the largest effect on the area footprint of the ECC Breaker design. For ECC Breaker, the following multiplier designs on a Virtex6 FPGA were evaluated (postsynthesis): (i) A simple 113bit parallel polynomial multiplier needs 5,497 LUTs. (ii) A Mastrovito multiplier [22] interprets the \(\mathbb {F}_{2^m}\) multiplication as matrix multiplication and performs both a polynomial multiplication and the reduction step simultaneously. Unfortunately, it needs 7,104 LUTs. A polynomial multiplication and reduction with the used pentanomial can be implemented much more efficiently. (iii) Bernstein [7] combines some refined Karatsuba and Toom recursions for his batch binary Edwards multiplier. The code from [8] for a 113bit polynomial multiplier needs 4,409 LUTs. (iv) Finally, the best results were achieved with a slightly modified binary Karatsuba multiplier, described by RodrıguezHenrıquez and Koç [27]. Their recursive algorithm was applied down to a \(16 \times 16\)bit multiplier level, which is synthesized as standard polynomial multiplier. The formulas for the resulting multiplier structure are given in Appendix B. The design only requires 3,757 LUTs. At last the design was equipped with several pipeline stages such that it can be clocked with high frequencies.
\(\mathbb {F}_{2^m}\) multiplier. Computing primefield multiplications in hardware can be a troublesome and very resourceintensive task. In the case of a Virtex6, dedicated DSP slices were used for integer multiplications. As a result, the two \(\mathbb {F}_{n}\) multipliers are very resource efficient, requirering only \(2\times 145\) DSP slices and 2 % of all slices.
6 Results and Transferability of Results
Hierarchical representation of final hardware design (post placeandroute).

ECC Breaker requires (post placeandroute) 47 % of all available slices (17,782/37,680), 41 % of all LUTs (62,657/150,720), 19 % of all registers (58,788/301,440), 37 % of all DSP macros (290/768), and less than 10 % of all block RAMs. The biggest components are the point addition module and the \(\mathbb {F}_{2^m}\) inverse module. Although already extensively optimized, the 10 \(\mathbb {F}_{2^m}\) multipliers require about 64 % of all slices. As the placeandroute tool performs optimizations across module borders, the slice counts of all components are just approximations.
6.1 ECC Breaker on Different FPGAs
As the VHDL code is portable, the suitability of ECC Breaker was also evaluated for other Xilinx FPGAs. Although size was a secondary optimization goal, ECC Breaker has been designed for a particular Virtex6 FPGA. So it does not come with surprise that ECC Breaker does only fit into certain FPGAs that come with certain features. For instance, the used ML605 development board incorporates a Virtex6 FPGA that comes with 768 DSPs (of which 290 are used).
An overview of synthesis results on different FPGAs is given in Table 3. Fortunately, the ECC Breaker design is perfectly suitable for all kind of the latest Xilinx Virtex7 (targets high performance designs), Kintex7 (targets best performance per cost), and Artix7 (low cost) FPGA devices. The Virtex7 and Kintex7 development boards can even fit multiple ECC Breaker instances. However, the Kintex7 KC705 development board fits the most instances per cost. The prices, taken from www.avnet.com [1], do not contain taxes and do not contain potential bulk discounts.
ECC Breaker on different FPGAs (post synthesis).

6.2 Expected Runtimes
Approximations of bestcase runtimes and costs for different targets.

Computing a discrete logarithm of a 113bit Koblitz curve (denoted as ECC2K112) can be done in about 3 days, when a cluster of 256 Spartan6 FPGAs would be available. With the same cluster, it would be possible to attack a 113bit binaryfield curve (denoted as ECC2112) in 28 days.
The Certicom challenge ECC2K130 targeted by Bailey et al. [5] and Fan et al. [12] can be computed in 127 days, assuming a budget of one million USD. The Certicom challenge ECC2131 can be computed in 145 days, assuming a budget of ten million USD. Targeting the smallest standardized NIST curve B163 would take 520 years, assuming a budget of one billion USD, which would be a reasonable budget of certain agencies. However, if there were one billion USD to be spent, it would be more reasonable to go for a dedicated ASIC design.
7 Lessons Learned
After spending months of research time on continuously improving the ECC Breaker design, there are some important insights that need to be discussed.
Maximum Achievable Frequency. As ECC Breaker is a fairly complex and large design, the hardware synthesizer reached its limit when it comes to maximum frequency approximations. In most cases, it was only possible to reach a fraction of the theoretically given frequency after mapping and routing.
Limited by Power Consumption. However, it was not even possible to run the ML605 development boards at maximum postmapandroute frequency. An average power consumption of about 12 A at the internal power supply resulted in a sporadic emergency switchoff with which the power controller protected itself. This is rather strange, considering that the internal power supply is designed to support 20 Ampere. Therefore it was necessary to reduce the clock frequency further in order to achieve a stable operation. Finally, ECC Breaker was running at 165 MHz even though the synthesizer approximated a maximum clock frequency of 275 MHz.
Multiple Instances per FPGA. It was previously mentioned that some FPGAs fit multiple ECC Breaker instances. However, it has yet to be tested whether two ECC Breaker instances per FPGA at lower clock frequency outperform a single instance, clocked with a higher frequency. Especially under consideration of the previously discussed routing and power problems, multiple ECC Breaker instances per FPGA might not be feasible.
Fruitless Cycles. An initial implementation made use of a negation map. However, the possibility of a fruitless cycle in which \(X_{i+1} = f(X_i)= X_i + R[j]  R[j] = X_i\) rendered the hardware implementation with negation map useless. The probability that a fruitless cycle occurs is \(p = \frac{1}{2 \cdot m \cdot r}\), \(r\) being the number of branches and \(m\) the size of the automorphism. The probability to encounter a fruitless cycle after \(i\) iterations is \(1(1p)^i\). Given 1,024 branches, an automorphism of size 113, and a clock rate of 165 MHz, the iteration function was trapped in a cycle with a probability of 99 % after less than one second. It is subject to future research how to efficiently get rid of the fruitlesscycle problem in a fully pipelined hardware design.
8 Conclusion
This work presents a circular, selfsufficient, highly pipelined, fully autonomous hardware design that was used to practically compute the discrete logarithm of a 113bit Koblitz curve within extrapolated 24 days on mere 18 Virtex6 FPGAs. However, because of the scalability and adaptability of ECC Breaker, even more complex results can be expected. This work will bring the community one step closer to solving the ECC2K130 challenge.
Footnotes
 1.
The attack actually run for 47 days but not all FPGAs were active at all times.
Notes
Acknowledgments
The authors are really grateful to Wolfgang Kastl and Jürgen Fuß from the University of Applied Sciences Upper Austria who provided sixteen ML605 boards and continuously supported us and the authors would like to thank the reviewers for their helpful comments.
This work has been supported in part by the Austrian Government through the research program FITIT under the project number 835917 (project NewP@ss), by the European Commission through the FP7 program under project number 610436 (project MATTHEW), and the Secure Information Technology CenterAustria (ASIT).
Supplementary material
References
 1.Avnet Inc, Feb 2014. http://www.avnet.com/
 2.SciEngines GmbH, Feb 2014. http://www.sciengines.com/
 3.Babbage, S., Catalano, D., Cid, C., de Weger, B., Dunkelman, O., Gehrmann, C., Granboulan, L., Güneysu, T., Hermans, J., Lange, T., Lenstra, A., Mitchell, C., Näslund, M., Nguyen, P., Paar, C., Paterson, K., Pelzl, J., Pornin, T., Preneel, B., Rechberger, C., Rijmen, V., Robshaw, M., Rupp, A., Schläffer, M., Vaudenay, S., Vercauteren, F., Ward. M.: ECRYPT II yearly report on algorithms and keysizes (2011–2012). http://www.ecrypt.eu.org/ Sep 2012
 4.Bailey, D.V., Baldwin, B., Batina, L., Bernstein, D.J., Birkner, P., Bos, J.W., van Damme, G., de Meulenaer, G., Fan, J., Güneysu, T., Gurkaynak, F., Kleinjung, T., Lange, T., Mentens, N., Paar, C., Regazzoni, F., Schwabe, P., Uhsadel, L.: The certicom challenges ECC2X. IACR cryptology ePrint archive, Report 2009/466 (2009)Google Scholar
 5.Bailey, D.V., Batina, L., Bernstein, D.J., Birkner, P., Bos, J.W., Chen, H.C., Cheng, C.M., van Damme, G., de Meulenaer, G., Perez, L.J.D., Fan, J., Güneysu, T., Gurkaynak, F., Kleinjung, T., Lange, T., Mentens, N., Niederhagen, R., Paar, C., Regazzoni, F., Schwabe, P., Uhsadel, L., Herrewege, A.V., Yang. B.Y.: Breaking ECC2K130. IACR cryptology ePrint archive, Report 2009/541 (2009)Google Scholar
 6.Barker, E., Roginsky, A.: Recommendation for cryptographic key generation. NIST Spec. Publ. 800, 133 (2012)Google Scholar
 7.Bernstein, D.J.: Batch binary edwards. In: Halevi, S. (ed.) CRYPTO 2009. LNCS, vol. 5677, pp. 317–336. Springer, Heidelberg (2009)CrossRefGoogle Scholar
 8.Bernstein, D.J.: Binary Batch Edwards 113bit Multiplier (May 2009). http://binary.cr.yp.to/bbe251/113.gz
 9.Bos, J.W., Kaihara, M.E., Kleinjung, T., Lenstra, A.K., Montgomery, P.L.: Solving a 112bit prime elliptic curve discrete logarithm problem on game consoles using sloppy reduction. Int. J. Appl. Crypt. 2(3), 212 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
 10.Certicom Research: The Certicom ECC Challenge (Nov 1997). https://www.certicom.com/index.php/thecerticomeccchallenge
 11.Cohen, H., Frey, G., Avanzi, R., Doche, C., Lange, T., Nguyen, K., Vercauteren, F. (eds.): Handbook of Elliptic and Hyperelliptic Curve Cryptography: Discrete Mathematics and its Applications. Handbook of Elliptic and Hyperelliptic Curve Cryptography. Chapman and Hall/CRC, Boca Raton (2006)zbMATHGoogle Scholar
 12.Fan, J., Bailey, D.V., Batina, L., Güneysu, T., Paar, C., Verbauwhede, I.: Breaking elliptic curve cryptosystems using reconfigurable hardware. In: Field Programmable Logic and Applications (FPL), pp. 133–138. IEEE (2010)Google Scholar
 13.Frey, G., Rück, H.G.: A remark concerning mdivisibility and the discrete logarithm in the divisor class group of curves. Math. Comput. 62(206), 865–874 (1994)MathSciNetzbMATHGoogle Scholar
 14.Gallant, R., Lambert, R., Vanstone, S.: Improving the parallelized Pollard lambda search on anomalous binary curves. Math. Comput. Am. Math. Soc. 69(232), 1699–1705 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
 15.Gaudry, P., Hess, F., Smart, N.P.: Constructive and destructive facets of weil descent on elliptic curves. J. Cryptol. 15(1), 19–46 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
 16.Güneysu, T., Paar, C., Pelzl, J.: Attacking elliptic curve cryptosystems with specialpurpose hardware. In: FPGA, pp. 207. ACM Press (2007)Google Scholar
 17.Hankerson, D., Vanstone, S., Menezes, A.J.: Guide to Elliptic Curve Cryptography. Springer, New York (2004)zbMATHGoogle Scholar
 18.Harley, R.: Elliptic curve discrete logarithms: ECC2K108 (2000). http://cristal.inria.fr/harley/ecdl7/readMe.html
 19.Itoh, T., Tsujii, S.: A fast algorithm for computing multiplicative inverses in GF(\(2^m\)) using normal bases. Inf. Comput. 78(3), 171–177 (1988)MathSciNetCrossRefzbMATHGoogle Scholar
 20.Judge, L., Schaumont, P.: A flexible hardware ECDLP engine in bluespec. In: SpecialPurpose Hardware for Attacking Cryptographic Systems (SHARCS) (2012)Google Scholar
 21.Mane, S., Judge, L., Schaumont, P.: An integrated primefield ECDLP hardware accelerator with highperformance modular arithmetic units. In: Reconfigurable Computing and FPGAs, pp. 198–203. IEEE, Nov. 2011Google Scholar
 22.Mastrovito, E.D.: VLSI designs for multiplication over finite fields GF(\(2^m\)), pp. 297–309. In: Applied Algebra, Algebraic Algorithms and ErrorCorrecting Codes (1988)Google Scholar
 23.Menezes, A.J., Okamoto, T., Vanstone, S.A.: Reducing elliptic curve logarithms to logarithms in a finite field. Trans. Inf. Theory 39(5), 1639–1646 (1993)MathSciNetCrossRefzbMATHGoogle Scholar
 24.Meurice de Dormale, G., Bulens, P., Quisquater, J.J.: Collision search for elliptic curve discrete logarithm over GF(2\(^{m}\)) with FPGA. In: Paillier, P., Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 378–393. Springer, Heidelberg (2007)CrossRefGoogle Scholar
 25.Pohlig, S.C., Hellman, M.E.: An improved algorithm for computing logarithms over \(GF(p)\) and its cryptographic significance. Trans. Inf. Theory 24(1), 106–110 (1978)MathSciNetCrossRefzbMATHGoogle Scholar
 26.Pollard, J.M.: A monte carlo method for factorization. BIT Numer. Math. 15(3), 331–334 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
 27.RodrıguezHenrıquez, F., Koç, Ç.: On fully parallel karatsuba multipliers for GF(\(2^m\)), pp. 405–410. In Computer Science and Technology (2003)Google Scholar
 28.Stier, C.: Students break record by folding toilet paper 13 times. http://www.newscientist.com/blogs/nstv/2012/01/paperfoldinglimitspushed.html. Jan 2012
 29.Teske, E.: Speeding up pollard’s rho method for computing discrete logarithms. In: Buhler, J.P. (ed.) ANTS 1998. LNCS, vol. 1423, pp. 541–554. Springer, Heidelberg (1998)CrossRefGoogle Scholar
 30.van Oorschot, P.C., Wiener, M.J.: Parallel collision search with cryptanalytic applications. J. Cryptol. 12(1), 1–28 (1999)MathSciNetCrossRefzbMATHGoogle Scholar
 31.Wiener, M., Zuccherato, R.J.: Faster attacks on elliptic curve cryptosystems. In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 190–200. Springer, Heidelberg (1999)CrossRefGoogle Scholar