Skip to main content
Log in

Profile Guided Dataflow Transformation for FPGAs and CPUs

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript


This paper proposes a new high-level approach for optimising field programmable gate array (FPGA) designs. FPGA designs are commonly implemented in low-level hardware description languages (HDLs), which lack the abstractions necessary for identifying opportunities for significant performance improvements. Using a computer vision case study, we show that modelling computation with dataflow abstractions enables substantial restructuring of FPGA designs before lowering to the HDL level, and also improve CPU performance. Using the CPU transformations, runtime is reduced by 43 %. Using the FPGA transformations, clock frequency is increased from 67MHz to 110MHz. Our results outperform commercial low-level HDL optimisations, showcasing dataflow program abstraction as an amenable computation model for highly effective FPGA optimisation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others




  1. Adl-Tabatabai, A., Cierniak, M., Lueh, G., Parikh, V.M., & Stichnoth, J.M. (1998). Fast, effective code generation in a just-in-time java compiler. In Proceedings of the ACM SIGPLAN ’98 Conference on programming language design and implementation (PLDI), Montreal, Canada, June 17-19, 1998, pp. 280–290. ACM.

  2. Bacon, D.F., Graham, S.L., & Sharp, O.J. (1994). Compiler transformations for high-performance computing. ACM Computing Surveys, 26(4), 345–420.

    Article  Google Scholar 

  3. Bezati, E., Mattavelli, M., & Janneck, J.W. (2013). High-level synthesis of dataflow programs for signal processing systems. In International symposium on image and signal processing and analysis (ISPA), Trieste, Italy September 4-6, pp. 750–754. IEEE.

  4. Bhowmik, D., Wallace, A.M., Stewart, R., Qian, X., & Michaelson, G.J. (2014). Profile driven dataflow optimisation of mean shift visual tracking. In IEEE Global conference on signal and information processing, GlobalSIP 2014, Atlanta, GA, USA, December 3-5, pp. 1–5.

  5. Bonenfant, A., Chen, Z., Hammond, K., Michaelson, G., Wallace, A., & Wallace, I. (2007). Towards Resource-certified software: A formal Cost Model for Time and Its Application to an Image-Processing Example. In Proceedings ACM symposium on applied computing, pp. 1307–1314.

  6. Brown, C., Danelutto, M., Hammond, K., Kilpatrick, P., & Elliott, A. (2014). Cost-directed refactoring for parallel erlang programs. International Journal of Parallel Programming, 42(4), 564– 582.

    Article  Google Scholar 

  7. Brown, C., Loidl, H., & Hammond, K. (2011). ParaForming: Forming parallel haskell programs using novel refactoring techniques. In Peña, R., & Page, R.L. (Eds.) Trends in functional programming, 12th international symposium, TFP 2011, Madrid, Spain, May 16-18, 2011, revised selected papers, lecture notes in computer science, vol. 7193, pp. 82–97. Springer.

  8. Brunet, S.C., Alberti, C., Mattavelli, M., & Janneck, J.W. (2013). Turnus: A unified dataflow design space exploration framework for heterogeneous parallel systems. In Conference on design and architectures for signal and image processing, Cagliari, Italy, October 8-10, 2013, pp. 47–54. IEEE.

  9. Chang, P.P., Mahlke, S.A., Chen, W.Y., mei, W., & Hwu, W. (1992). Profile-guided automatic inline expansion for C programs. Software, Practice Experience, 22(5), 349–369.

    Article  Google Scholar 

  10. Comaniciu, D., Ramesh, V., & Meer, P. (2003). Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(5), 564–577.

    Article  Google Scholar 

  11. Dagum, L., & Menon, R. (1998). OpenMP: An industry-standard api for shared-memory programming. IEEE Computational Science and Engineering, 5(1), 46–55.

    Article  Google Scholar 

  12. Eker, J., & Janneck, J.W. (2003). CAL language report specification of the CAL actor language. Tech. Rep. UCB/ERL M03/48, EECS Department. Berkeley: University of California.

    Google Scholar 

  13. Floating-point working group, IEEE computer society: IEEE standard for binary floating-point arithmetic (1985). Note: Standard 754–1985.

  14. Gordon, M.I., Thies, W., Karczmarek, M., Lin, J., Meli, A.S., Lamb, A.A., Leger, C., Wong, J., Hoffmann, H., Maze, D., & Amarasinghe, S.P. (2002). A stream compiler for communication-exposed architectures. In Proceedings of the 10th international conference on architectural support for programming languages and operating systems (ASPLOS-X), San Jose, California, USA, October 5-9, 2002., pp. 291–303.

  15. Govindu, G., Zhuo, L., Choi, S., & Prasanna, V.K. (2004). Analysis of High-Performance Floating-Point Arithmetic on FPGAs. In 18th International parallel and distributed processing symposium (IPDPS 2004), CD-ROM / abstracts proceedings, 26-30 April, Santa Fe, New Mexico, USA. IEEE Computer Society.

  16. Grov, G., & Michaelson, G. (2010). Hume box calculus: Robust system development through software transformation. Higher-Order and Symbolic Computation, 23(2), 191–226.

    Article  MathSciNet  MATH  Google Scholar 

  17. Intel: Intel VTune performance analyzer.

  18. Janneck, J.W., Mattavelli, M., Raulet, M., & Wipliez, M. (2010). Reconfigurable video coding: A stream programming approach to the specification of new video coding standards. In Feng, W., & Mayer-Patel, K. (Eds.) Proceedings of the first annual ACM SIGMM conference on multimedia systems, MMSys 2010, Phoenix, Arizona, USA, February 22-23, 2010, pp. 223–234. ACM.

  19. Janneck, J.W., Miller, I.D., Parlour, D.B., Roquier, G., Wipliez, M., & Raulet, M. (2011). Synthesizing hardware from dataflow programs - An MPEG-4 simple profile decoder case study. Signal Processing Systems, 63 (2), 241–249.

    Article  Google Scholar 

  20. Kuck, D.J. (1977). A survey of parallel machine organization and programming. ACM Computing Surveys, 9 (1), 29–59.

    Article  MathSciNet  MATH  Google Scholar 

  21. Marathe, J., & Mueller, F. (2006). Hardware Profile-guided automatic page placement for ccnuma systems. In J.Torrellas, & S.Chatterjee (Eds.) Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming, PPOPP 2006, New York, New York, USA, March 29-31, pp. 90–99. ACM.

  22. of Reading, U.: Performance evaluation of tracking and surveillance (PETS 2009) dataset (2009).

  23. Scholz, S. (2003). Single Assignment C: Efficient support for high-level array operations in a functional setting. Journal of Functional Programming, 1(6), 1005—1059.

    MathSciNet  MATH  Google Scholar 

  24. Stewart, R., Bhowmik, D., Michaelson, G., & Wallace, A. (2015). Open access dataset for profile guided dataflow transformation for FPGAs and CPUs. doi:10.17861/7925c541-42d9-4ded-9a01-5ac652d51353.

  25. Trinder, P.W., Hammond, K., Loidl, H.W., & Peyton Jones, S.L. (1998). Algorithm + Strategy = Parallelism. Journal of Functional Programming, 8(1), 23–60.

    Article  MathSciNet  MATH  Google Scholar 

  26. Underwood, K.D. (2004). FPGAs vs. CPUs: Trends in peak floating-point performance. In R. Tessier, & H. Schmit (Eds.) Proceedings of the ACM/SIGDA 12th international symposium on field programmable gate arrays, FPGA 2004, Monterey, California, USA, February 22–24, 2004, pp. 171–180. ACM.

  27. Xilinx: ISE design suite.

  28. Yviquel, H., Lorence, A., Jerbi, K., Cocherel, G., Sanchez, A., & Raulet, M. (2013). Orcc: Multimedia development made easy. In ACM multimedia conference, MM ’13, Barcelona, Spain, October 21–25, 2013, pp. 863–866. ACM.

Download references


We acknowledge the support of the Engineering and Physical Research Council, grant references EP/K009931/1 (Programmable embedded platforms for remote and compute intensive image processing applications). The authors thank Blair Archibald for helpful feedback.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Robert Stewart.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stewart, R., Bhowmik, D., Wallace, A. et al. Profile Guided Dataflow Transformation for FPGAs and CPUs. J Sign Process Syst 87, 3–20 (2017).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: