Profile Guided Dataflow Transformation for FPGAs and CPUs
- 259 Downloads
This paper proposes a new high-level approach for optimising field programmable gate array (FPGA) designs. FPGA designs are commonly implemented in low-level hardware description languages (HDLs), which lack the abstractions necessary for identifying opportunities for significant performance improvements. Using a computer vision case study, we show that modelling computation with dataflow abstractions enables substantial restructuring of FPGA designs before lowering to the HDL level, and also improve CPU performance. Using the CPU transformations, runtime is reduced by 43 %. Using the FPGA transformations, clock frequency is increased from 67MHz to 110MHz. Our results outperform commercial low-level HDL optimisations, showcasing dataflow program abstraction as an amenable computation model for highly effective FPGA optimisation.
KeywordsDataflow Profiling Transformations FPGA CPU
We acknowledge the support of the Engineering and Physical Research Council, grant references EP/K009931/1 (Programmable embedded platforms for remote and compute intensive image processing applications). The authors thank Blair Archibald for helpful feedback.
- 1.Adl-Tabatabai, A., Cierniak, M., Lueh, G., Parikh, V.M., & Stichnoth, J.M. (1998). Fast, effective code generation in a just-in-time java compiler. In Proceedings of the ACM SIGPLAN ’98 Conference on programming language design and implementation (PLDI), Montreal, Canada, June 17-19, 1998, pp. 280–290. ACM.Google Scholar
- 3.Bezati, E., Mattavelli, M., & Janneck, J.W. (2013). High-level synthesis of dataflow programs for signal processing systems. In International symposium on image and signal processing and analysis (ISPA), Trieste, Italy September 4-6, pp. 750–754. IEEE.Google Scholar
- 4.Bhowmik, D., Wallace, A.M., Stewart, R., Qian, X., & Michaelson, G.J. (2014). Profile driven dataflow optimisation of mean shift visual tracking. In IEEE Global conference on signal and information processing, GlobalSIP 2014, Atlanta, GA, USA, December 3-5, pp. 1–5.Google Scholar
- 5.Bonenfant, A., Chen, Z., Hammond, K., Michaelson, G., Wallace, A., & Wallace, I. (2007). Towards Resource-certified software: A formal Cost Model for Time and Its Application to an Image-Processing Example. In Proceedings ACM symposium on applied computing, pp. 1307–1314.Google Scholar
- 7.Brown, C., Loidl, H., & Hammond, K. (2011). ParaForming: Forming parallel haskell programs using novel refactoring techniques. In Peña, R., & Page, R.L. (Eds.) Trends in functional programming, 12th international symposium, TFP 2011, Madrid, Spain, May 16-18, 2011, revised selected papers, lecture notes in computer science, vol. 7193, pp. 82–97. Springer.Google Scholar
- 8.Brunet, S.C., Alberti, C., Mattavelli, M., & Janneck, J.W. (2013). Turnus: A unified dataflow design space exploration framework for heterogeneous parallel systems. In Conference on design and architectures for signal and image processing, Cagliari, Italy, October 8-10, 2013, pp. 47–54. IEEE.Google Scholar
- 12.Eker, J., & Janneck, J.W. (2003). CAL language report specification of the CAL actor language. Tech. Rep. UCB/ERL M03/48, EECS Department. Berkeley: University of California. http://www.eecs.berkeley.edu/Pubs/TechRpts/2003/4186.html.Google Scholar
- 13.Floating-point working group, IEEE computer society: IEEE standard for binary floating-point arithmetic (1985). Note: Standard 754–1985.Google Scholar
- 14.Gordon, M.I., Thies, W., Karczmarek, M., Lin, J., Meli, A.S., Lamb, A.A., Leger, C., Wong, J., Hoffmann, H., Maze, D., & Amarasinghe, S.P. (2002). A stream compiler for communication-exposed architectures. In Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X), San Jose, California, USA, October 5-9, 2002., pp. 291–303.Google Scholar
- 15.Govindu, G., Zhuo, L., Choi, S., & Prasanna, V.K. (2004). Analysis of High-Performance Floating-Point Arithmetic on FPGAs. In 18th International parallel and distributed processing symposium (IPDPS 2004), CD-ROM / abstracts proceedings, 26-30 April, Santa Fe, New Mexico, USA. IEEE Computer Society.Google Scholar
- 17.Intel: Intel VTune performance analyzer. https://software.intel.com/en-us/intel-vtune-amplifier-xe.
- 18.Janneck, J.W., Mattavelli, M., Raulet, M., & Wipliez, M. (2010). Reconfigurable video coding: A stream programming approach to the specification of new video coding standards. In Feng, W., & Mayer-Patel, K. (Eds.) Proceedings of the first annual ACM SIGMM conference on multimedia systems, MMSys 2010, Phoenix, Arizona, USA, February 22-23, 2010, pp. 223–234. ACM.Google Scholar
- 21.Marathe, J., & Mueller, F. (2006). Hardware Profile-guided automatic page placement for ccnuma systems. In J.Torrellas, & S.Chatterjee (Eds.) Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2006, New York, New York, USA, March 29-31, pp. 90–99. ACM.Google Scholar
- 22.of Reading, U.: Performance evaluation of tracking and surveillance (PETS 2009) dataset (2009). http://www.cvg.rdg.ac.uk/PETS2009/.
- 24.Stewart, R., Bhowmik, D., Michaelson, G., & Wallace, A. (2015). Open access dataset for profile guided dataflow transformation for FPGAs and CPUs. doi: 10.17861/7925c541-42d9-4ded-9a01-5ac652d51353.
- 26.Underwood, K.D. (2004). FPGAs vs. CPUs: Trends in peak floating-point performance. In R. Tessier, & H. Schmit (Eds.) Proceedings of the ACM/SIGDA 12th international symposium on field programmable gate arrays, FPGA 2004, Monterey, California, USA, February 22–24, 2004, pp. 171–180. ACM.Google Scholar
- 27.Xilinx: ISE design suite. http://www.xilinx.com/products/design-tools/ise-design-suite.
- 28.Yviquel, H., Lorence, A., Jerbi, K., Cocherel, G., Sanchez, A., & Raulet, M. (2013). Orcc: Multimedia development made easy. In ACM multimedia conference, MM ’13, Barcelona, Spain, October 21–25, 2013, pp. 863–866. ACM.Google Scholar