Skip to main content

Theoretical Model of Computation and Algorithms for FPGA-Based Hardware Accelerators

  • Conference paper
  • First Online:
Theory and Applications of Models of Computation (TAMC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11436))

Abstract

While FPGAs have been used extensively as hardware accelerators in industrial computation [20], no theoretical model of computation has been devised for the study of FPGA-based accelerators. In this paper, we present a theoretical model of computation on a system with conventional CPU and an FPGA, based on word-RAM. We show several algorithms in this model which are asymptotically faster than their word-RAM counterparts. Specifically, we show an algorithm for sorting, evaluation of associative operation and general techniques for speeding up some recursive algorithms and some dynamic programs. We also derive lower bounds on the running times needed to solve some problems.

This work was carried out while the authors were participants in the DIMATIA-DIMACS REU exchange program at Rutgers University.

The work was supported by the grant SVV–2017–260452.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ajtai, M., Komlós, J., Szemerédi, E.: An 0(n log n) sorting network. In: Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, STOC 1983, pp. 1–9. ACM, New York (1983). http://doi.acm.org/10.1145/800061.808726

  2. Alam, N.: Implementation of genetic algorithms in FPGA-based reconfigurable computing systems. Master’s thesis, Clemson University (2009). https://tigerprints.clemson.edu/all_theses/618/?utm_source=tigerprints.clemson.edu%2Fall_theses%2F618&utm_medium=PDF&utm_campaign=PDFCoverPages

  3. Batcher, K.E.: Sorting networks and their applications. In: Proceedings of the Spring Joint Computer Conference, 30 April–2 May 1968, AFIPS 1968 (Spring), pp. 307–314. ACM, New York (1968). http://doi.acm.org/10.1145/1468075.1468121

  4. Che, S., Li, J., Sheaffer, J.W., Skadron, K., Lach, J.: Accelerating compute-intensive applications with GPUs and FPGAs. In: 2008 Symposium on Application Specific Processors, pp. 101–107, June 2008

    Google Scholar 

  5. Chodowiec, P., Gaj, K.: Very compact FPGA implementation of the AES algorithm. In: Walter, C.D., Koç, Ç.K., Paar, C. (eds.) CHES 2003. LNCS, vol. 2779, pp. 319–333. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45238-6_26

    Chapter  Google Scholar 

  6. Chrysos, G., et al.: Opportunities from the use of FPGAs as platforms for bioinformatics algorithms. In: 2012 IEEE 12th International Conference on Bioinformatics Bioengineering (BIBE), pp. 559–565, November 2012

    Google Scholar 

  7. Cormen, T.H., Leiserson, C.E.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  8. Demaine, E.: Cache-oblivious algorithms and data structures. EEF Summer Sch. Massive Data Sets 8(4), 1–249 (2002)

    Google Scholar 

  9. Grozea, C., Bankovic, Z., Laskov, P.: FPGA vs. multi-core CPUs vs. GPUs: hands-on experience with a sorting application. In: Keller, R., Kramer, D., Weiss, J.-P. (eds.) Facing the Multicore-Challenge. LNCS, vol. 6310, pp. 105–117. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16233-6_12

    Chapter  Google Scholar 

  10. Guo, Z., Najjar, W., Vahid, F., Vissers, K.: A quantitative analysis of the speedup factors of FPGAs over processors. In: Proceedings of the 2004 ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, FPGA 2004, pp. 162–170. ACM, New York (2004). http://doi.acm.org/10.1145/968280.968304

  11. Hagerup, T.: Sorting and searching on the word RAM. In: Morvan, M., Meinel, C., Krob, D. (eds.) STACS 1998. LNCS, vol. 1373, pp. 366–398. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0028575

    Chapter  Google Scholar 

  12. Harper, L.H.: An \(n \log n\) lower bound on synchronous combinational complexity. Proc. Am. Math. Soc. 64(2), 300–306 (1977). http://www.jstor.org/stable/2041447

    MathSciNet  MATH  Google Scholar 

  13. Huffstetler, J.: Intel processors and FPGAs-better together, May 2018. https://itpeernetwork.intel.com/intel-processors-fpga-better-together/

  14. Hussain, H.M., Benkrid, K., Seker, H., Erdogan, A.T.: FPGA implementation of k-means algorithm for bioinformatics application: an accelerated approach to clustering microarray data. In: 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 248–255, June 2011

    Google Scholar 

  15. Karatsuba, A., Ofman, Y.: Multiplication of many-digital numbers by automatic computers. In: Dokl. Akad. Nauk SSSR, vol. 145, pp. 293–294 (1962). http://mi.mathnet.ru/dan26729

  16. Karkooti, M., Cavallaro, J.R., Dick, C.: FPGA implementation of matrix inversion using QRD-RLS algorithm. In: Conference Record of the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers 2005, pp. 1625–1629 (2005)

    Google Scholar 

  17. Ma, L., Agrawal, K., Chamberlain, R.D.: A memory access model for highly-threaded many-core architectures. Future Gener. Comput. Syst. 30, 202–215 (2014). http://www.sciencedirect.com/science/article/pii/S0167739X13001349, special Issue on Extreme Scale Parallel Architectures and Systems, Cryptography in Cloud Computing and Recent Advances in Parallel and Distributed Systems, ICPADS 2012 Selected Papers

  18. Mahram, A.: FPGA acceleration of sequence analysis tools in bioinformatics (2013). https://open.bu.edu/handle/2144/11126

  19. Reed, B.: The height of a random binary search tree. J. ACM 50(3), 306–332 (2003). https://doi.org/10.1145/765568.765571

    Article  MathSciNet  MATH  Google Scholar 

  20. Romoth, J., Porrmann, M., Rückert, U.: Survey of FPGA applications in the period 2000–2015 (Technical report) (2017)

    Google Scholar 

  21. van Rooij, J.M., Bodlaender, H.L.: Exact algorithms for dominating set. Discrete Appl. Math. 159(17), 2147–2164 (2011). http://www.sciencedirect.com/science/article/pii/S0166218X11002393

    Article  MathSciNet  Google Scholar 

  22. Sklavos, D.: DDR3 vs. DDR4: raw bandwidth by the numbers, September 2015. https://www.techspot.com/news/62129-ddr3-vs-ddr4-raw-bandwidth-numbers.html

  23. Strassen, V.: Gaussian elimination is not optimal. Numer. Math. 13(4), 354–356 (1969). https://doi.org/10.1007/BF02165411

    Article  MathSciNet  MATH  Google Scholar 

  24. Vitter, J.S.: Algorithms and data structures for external memory. Found. Trends Theor. Comput. Sci. 2(4), 54–63 (2008). https://doi.org/10.1561/0400000014

    Article  MathSciNet  Google Scholar 

  25. Vollmer, H.: Introduction to Circuit Complexity: A Uniform Approach. Springer, Heidelberg (1999). https://doi.org/10.1007/978-3-662-03927-4

    Book  MATH  Google Scholar 

  26. Woeginger, G.J.: Exact algorithms for NP-hard problems: a survey. In: Jünger, M., Reinelt, G., Rinaldi, G. (eds.) Combinatorial Optimization - Eureka, You Shrink!. LNCS, vol. 2570, pp. 185–207. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36478-1_17. http://dl.acm.org/citation.cfm?id=885909

    Chapter  Google Scholar 

  27. Zwick, U., Gupta, A.: Concrete complexity lecture notes, lecture 3 (1996). www.cs.tau.ac.il/~zwick/circ-comp-new/two.ps

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jakub Tětek .

Editor information

Editors and Affiliations

A Simulation of Word-RAM

A Simulation of Word-RAM

Theorem 10

A Word RAM with word size w running in time t(n) using m(n) memory can be simulated by a circuit of size \(\mathcal {O}(t(n)m(n)w\log (m(n)w))\) and depth \(\mathcal {O}(t(n) \log (m(n)w))\).

Proof

We first construct an asynchronous circuit. In the proof, we will be using the following two subcircuits for writing to and reading from the RAM’s memory.

Memory read subcircuit gets as input nw bits consisting of m(n) words of length w together with a number k which fits into one word when represented in binary. It returns the k’th group. There is such circuit with \(\mathcal {O}(m(n)w)\) gates and depth \(\mathcal {O}(\log (m(n)w))\).

Memory write subcircuit gets as input m(n)w bits consisting of m(n) words of length w and additional numbers k and v, both represented in binary, each fitting into a word. The circuit outputs the m(n)w bits from input with the exception of the k’th word, which is replaced by value v. There is such circuit with \(\mathcal {O}(m(n)w)\) gates in depth \(\mathcal {O}(\log w)\).

The circuit consists of t(n) layers, each of depth \(\mathcal {O}(\log (m(n)w))\). Each layer executes one step of the word-RAM. Each layer gets as input the memory of the RAM after execution of the previous instruction and the instruction pointer to the instruction which is to be executed and outputs the memory after execution of the instruction and a pointer to the next instruction. Each layer works in two phases.

In the first phase, we retrieve from memory the values necessary for execution of the instruction (including the address where the result is to be saved, in case of indirect access). We do this using the memory read subcircuit (or possibly two of them coupled together in case of indirect addressing). This can be done since the addresses from which the program reads can be inferred from the instruction pointer.

In the second phase, we execute all possible instruction on the values retrieved in phase 1. Note that all instructions of the word-RAM can be implemented by a circuit of depth \(\mathcal {O}(\log w)\). Each instruction has an output and optional wires for outputting the next instruction (which is used only by jump and conditional jump – all other instructions will output zeros). The correct instruction can be inferred from the instruction pointer, so we can use a binary tree to get the output from the correct instruction to specified wires. This output value is then stored in memory using the memory store subcircuit.

The first layer takes as input the input of the RAM. The last layer outputs the output of the RAM.

Every signal has to be delayed for at most \(\mathcal {O}(\log (m(n)w))\) steps. The number of gates is, therefore, increased by a factor of at most \(\mathcal {O}(\log (m(n)w))\).    \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hora, M., Končický, V., Tětek, J. (2019). Theoretical Model of Computation and Algorithms for FPGA-Based Hardware Accelerators. In: Gopal, T., Watada, J. (eds) Theory and Applications of Models of Computation. TAMC 2019. Lecture Notes in Computer Science(), vol 11436. Springer, Cham. https://doi.org/10.1007/978-3-030-14812-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14812-6_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14811-9

  • Online ISBN: 978-3-030-14812-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics