Abstract
While FPGAs have been used extensively as hardware accelerators in industrial computation [20], no theoretical model of computation has been devised for the study of FPGA-based accelerators. In this paper, we present a theoretical model of computation on a system with conventional CPU and an FPGA, based on word-RAM. We show several algorithms in this model which are asymptotically faster than their word-RAM counterparts. Specifically, we show an algorithm for sorting, evaluation of associative operation and general techniques for speeding up some recursive algorithms and some dynamic programs. We also derive lower bounds on the running times needed to solve some problems.
This work was carried out while the authors were participants in the DIMATIA-DIMACS REU exchange program at Rutgers University.
The work was supported by the grant SVV–2017–260452.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ajtai, M., Komlós, J., Szemerédi, E.: An 0(n log n) sorting network. In: Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, STOC 1983, pp. 1–9. ACM, New York (1983). http://doi.acm.org/10.1145/800061.808726
Alam, N.: Implementation of genetic algorithms in FPGA-based reconfigurable computing systems. Master’s thesis, Clemson University (2009). https://tigerprints.clemson.edu/all_theses/618/?utm_source=tigerprints.clemson.edu%2Fall_theses%2F618&utm_medium=PDF&utm_campaign=PDFCoverPages
Batcher, K.E.: Sorting networks and their applications. In: Proceedings of the Spring Joint Computer Conference, 30 April–2 May 1968, AFIPS 1968 (Spring), pp. 307–314. ACM, New York (1968). http://doi.acm.org/10.1145/1468075.1468121
Che, S., Li, J., Sheaffer, J.W., Skadron, K., Lach, J.: Accelerating compute-intensive applications with GPUs and FPGAs. In: 2008 Symposium on Application Specific Processors, pp. 101–107, June 2008
Chodowiec, P., Gaj, K.: Very compact FPGA implementation of the AES algorithm. In: Walter, C.D., Koç, Ç.K., Paar, C. (eds.) CHES 2003. LNCS, vol. 2779, pp. 319–333. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45238-6_26
Chrysos, G., et al.: Opportunities from the use of FPGAs as platforms for bioinformatics algorithms. In: 2012 IEEE 12th International Conference on Bioinformatics Bioengineering (BIBE), pp. 559–565, November 2012
Cormen, T.H., Leiserson, C.E.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
Demaine, E.: Cache-oblivious algorithms and data structures. EEF Summer Sch. Massive Data Sets 8(4), 1–249 (2002)
Grozea, C., Bankovic, Z., Laskov, P.: FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge. LNCS, vol. 6310, pp. 105–117. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16233-6_12
Guo, Z., Najjar, W., Vahid, F., Vissers, K.: A quantitative analysis of the speedup factors of FPGAs over processors. In: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, FPGA 2004, pp. 162–170. ACM, New York (2004). http://doi.acm.org/10.1145/968280.968304
Hagerup, T.: Sorting and searching on the word RAM. In: Morvan, M., Meinel, C., Krob, D. (eds.) STACS 1998. LNCS, vol. 1373, pp. 366–398. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0028575
Harper, L.H.: An \(n \log n\) lower bound on synchronous combinational complexity. Proc. Am. Math. Soc. 64(2), 300–306 (1977). http://www.jstor.org/stable/2041447
Huffstetler, J.: Intel processors and FPGAs-better together, May 2018. https://itpeernetwork.intel.com/intel-processors-fpga-better-together/
Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T.: FPGA implementation of k-means algorithm for bioinformatics application: an accelerated approach to clustering microarray data. In: 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 248–255, June 2011
Karatsuba, A., Ofman, Y.: Multiplication of many-digital numbers by automatic computers. In: Dokl. Akad. Nauk SSSR, vol. 145, pp. 293–294 (1962). http://mi.mathnet.ru/dan26729
Karkooti, M., Cavallaro, J.R., Dick, C.: FPGA implementation of matrix inversion using QRD-RLS algorithm. In: Conference Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers 2005, pp. 1625–1629 (2005)
Ma, L., Agrawal, K., Chamberlain, R.D.: A memory access model for highly-threaded many-core architectures. Future Gener. Comput. Syst. 30, 202–215 (2014). http://www.sciencedirect.com/science/article/pii/S0167739X13001349, special Issue on Extreme Scale Parallel Architectures and Systems, Cryptography in Cloud Computing and Recent Advances in Parallel and Distributed Systems, ICPADS 2012 Selected Papers
Mahram, A.: FPGA acceleration of sequence analysis tools in bioinformatics (2013). https://open.bu.edu/handle/2144/11126
Reed, B.: The height of a random binary search tree. J. ACM 50(3), 306–332 (2003). https://doi.org/10.1145/765568.765571
Romoth, J., Porrmann, M., Rückert, U.: Survey of FPGA applications in the period 2000–2015 (Technical report) (2017)
van Rooij, J.M., Bodlaender, H.L.: Exact algorithms for dominating set. Discrete Appl. Math. 159(17), 2147–2164 (2011). http://www.sciencedirect.com/science/article/pii/S0166218X11002393
Sklavos, D.: DDR3 vs. DDR4: raw bandwidth by the numbers, September 2015. https://www.techspot.com/news/62129-ddr3-vs-ddr4-raw-bandwidth-numbers.html
Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969). https://doi.org/10.1007/BF02165411
Vitter, J.S.: Algorithms and data structures for external memory. Found. Trends Theor. Comput. Sci. 2(4), 54–63 (2008). https://doi.org/10.1561/0400000014
Vollmer, H.: Introduction to Circuit Complexity: A Uniform Approach. Springer, Heidelberg (1999). https://doi.org/10.1007/978-3-662-03927-4
Woeginger, G.J.: Exact algorithms for NP-hard problems: a survey. In: Jünger, M., Reinelt, G., Rinaldi, G. (eds.) Combinatorial Optimization - Eureka, You Shrink!. LNCS, vol. 2570, pp. 185–207. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36478-1_17. http://dl.acm.org/citation.cfm?id=885909
Zwick, U., Gupta, A.: Concrete complexity lecture notes, lecture 3 (1996). www.cs.tau.ac.il/~zwick/circ-comp-new/two.ps
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Simulation of Word-RAM
A Simulation of Word-RAM
Theorem 10
A Word RAM with word size w running in time t(n) using m(n) memory can be simulated by a circuit of size \(\mathcal {O}(t(n)m(n)w\log (m(n)w))\) and depth \(\mathcal {O}(t(n) \log (m(n)w))\).
Proof
We first construct an asynchronous circuit. In the proof, we will be using the following two subcircuits for writing to and reading from the RAM’s memory.
Memory read subcircuit gets as input nw bits consisting of m(n) words of length w together with a number k which fits into one word when represented in binary. It returns the k’th group. There is such circuit with \(\mathcal {O}(m(n)w)\) gates and depth \(\mathcal {O}(\log (m(n)w))\).
Memory write subcircuit gets as input m(n)w bits consisting of m(n) words of length w and additional numbers k and v, both represented in binary, each fitting into a word. The circuit outputs the m(n)w bits from input with the exception of the k’th word, which is replaced by value v. There is such circuit with \(\mathcal {O}(m(n)w)\) gates in depth \(\mathcal {O}(\log w)\).
The circuit consists of t(n) layers, each of depth \(\mathcal {O}(\log (m(n)w))\). Each layer executes one step of the word-RAM. Each layer gets as input the memory of the RAM after execution of the previous instruction and the instruction pointer to the instruction which is to be executed and outputs the memory after execution of the instruction and a pointer to the next instruction. Each layer works in two phases.
In the first phase, we retrieve from memory the values necessary for execution of the instruction (including the address where the result is to be saved, in case of indirect access). We do this using the memory read subcircuit (or possibly two of them coupled together in case of indirect addressing). This can be done since the addresses from which the program reads can be inferred from the instruction pointer.
In the second phase, we execute all possible instruction on the values retrieved in phase 1. Note that all instructions of the word-RAM can be implemented by a circuit of depth \(\mathcal {O}(\log w)\). Each instruction has an output and optional wires for outputting the next instruction (which is used only by jump and conditional jump – all other instructions will output zeros). The correct instruction can be inferred from the instruction pointer, so we can use a binary tree to get the output from the correct instruction to specified wires. This output value is then stored in memory using the memory store subcircuit.
The first layer takes as input the input of the RAM. The last layer outputs the output of the RAM.
Every signal has to be delayed for at most \(\mathcal {O}(\log (m(n)w))\) steps. The number of gates is, therefore, increased by a factor of at most \(\mathcal {O}(\log (m(n)w))\). \(\square \)
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Hora, M., Končický, V., Tětek, J. (2019). Theoretical Model of Computation and Algorithms for FPGA-Based Hardware Accelerators. In: Gopal, T., Watada, J. (eds) Theory and Applications of Models of Computation. TAMC 2019. Lecture Notes in Computer Science(), vol 11436. Springer, Cham. https://doi.org/10.1007/978-3-030-14812-6_18
Download citation
DOI: https://doi.org/10.1007/978-3-030-14812-6_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14811-9
Online ISBN: 978-3-030-14812-6
eBook Packages: Computer ScienceComputer Science (R0)