International Journal of Parallel Programming

, Volume 44, Issue 4, pp 735–771 | Cite as

Performance Estimation of Task Graphs Based on Path Profiling

  • Marco LattuadaEmail author
  • Christian Pilato
  • Fabrizio Ferrandi


Correctly estimating the speed-up of a parallel embedded application is crucial to efficiently compare different parallelization techniques, task graph transformations or mapping and scheduling solutions. Unfortunately, especially in case of control-dominated applications, task correlations may heavily affect the execution time of the solutions and usually this is not properly taken into account during performance analysis. We propose a methodology that combines a single profiling of the initial sequential specification with different decisions in terms of partitioning, mapping, and scheduling in order to better estimate the actual speed-up of these solutions. We validated our approach on a multi-processor simulation platform: experimental results show that our methodology, effectively identifying the correlations among tasks, significantly outperforms existing approaches for speed-up estimation. Indeed, we obtained an absolute error less than 5 % in average, even when compiling the code with different optimization levels.


Performance estimation Path profiling Hierarchical Task Graph 


  1. 1.
    Wolf, W.: The future of multiprocessor systems-on-chips. In: Proceedings of the 41st Annual Design Automation Conference, DAC ’04, pp. 681–685 (2004)Google Scholar
  2. 2.
    Niemann, R., Marwedel, P.: An algorithm for hardware/software partitioning using mixed integer linear programming. Des. Autom. Embed. Syst. 2(2), 165–193 (1997)CrossRefGoogle Scholar
  3. 3.
    Marwedel, P.: Embedded System Design: Embedded Systems Foundations of Cyber-Physical Systems, 2nd edn. Springer, Berlin (2010)zbMATHGoogle Scholar
  4. 4.
    Ferrandi, F., Pilato, C., Tumeo, A., Sciuto, D.: Mapping and scheduling of parallel C applications with ant colony optimization onto Heterogeneous reconfigurable MPSoCs. In: Proceedings of the 15th IEEE Asia and South Pacific Design Automation Conference, ASP-DAC ’10, pp. 799–804, January 2010 (2010)Google Scholar
  5. 5.
    Ferrandi, F., Lanzi, P.L., Pilato, C., Sciuto, D., Tumeo, A.: Ant colony heuristic for mapping and scheduling task and communications on heterogeneous embedded systems. IEEE Trans. Comput. Aided Des. Integ. Circ. Syst. 29(6), 911–924 (2010)CrossRefGoogle Scholar
  6. 6.
    Benini, L., Bertozzi, D., Bogliolo, A., Menichelli, F., Olivieri, M.: MPARM: Exploring the Multi-Processor SoC Design Space with SystemC. J. VLSI Sign. Process. 41(2), 169–182 (2005)CrossRefGoogle Scholar
  7. 7.
    Beltrame, G., Fossati, L., Sciuto, D.: ReSP: A Nonintrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 28(12), 1857–1869 (2009)CrossRefGoogle Scholar
  8. 8.
    Li, Y.A., Antonio, J.K.: Estimating the execution time distribution for a task graph in a heterogeneous computing system. In Proceedings of the 6th Heterogeneous Computing Workshop, HCW ’97, pp. 172–184, (1997)Google Scholar
  9. 9.
    Manolache, S.: Analysis and optimisation of real-time systems with stochastic behaviour. Technical report, Linkoping University (2005)Google Scholar
  10. 10.
    Poplavko, P., Basten, T., Bekooij, M., van Meerbergen, J., Mesman, B.: Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In: Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, CASES ’03, pp. 63–72, (2003)Google Scholar
  11. 11.
    Coffman, E.G.: Computer and Job Shop Scheduling Theory. Wiley, New York (1976)zbMATHGoogle Scholar
  12. 12.
    Sahu, A., Balakrishnan, M., Panda, P.R.: A generic platform for estimation of multi-threaded program performance on heterogeneous multiprocessors. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pp. 1018–1023 (2009)Google Scholar
  13. 13.
    Yaldiz, S., Demir, A., Tasiran, S., Ienne, P., Leblebici, Y.: Characterizing and exploiting task-load variability and correlation for energy management in multi-core systems. In: ESTImedia, pp. 135–140 (2005)Google Scholar
  14. 14.
    Hubert, H., Stabernack, B., Wels, K.-I.: Performance and memory profiling for embedded system design. In: Proceedings of the International Symposium on Industrial Embedded Systems, SIES ’07, pp. 94–101 (July 2007)Google Scholar
  15. 15.
    Ball, T., Larus, J. R.: Efficient path profiling. In: MICRO-29: Proceedings of the 29th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 46–57 (1996)Google Scholar
  16. 16.
    Lattuada, M., Ferrandi, F.: Performance modeling of embedded applications with zero architectural knowledge. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and Cystem Cynthesis, CODES/ISSS ’10, pp. 277–286 (2010)Google Scholar
  17. 17.
    Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance modeling of parallel applications on MPSoCs. In: IEEE International Symposium on System-on-Chip, SOC ’09, pp. 64–67 (2009)Google Scholar
  18. 18.
    OpenMP. Application Program Interface, version 2.5 (May 2005)Google Scholar
  19. 19.
    Satish, N.R., Ravindran, K., Keutzer, K.: Scheduling task dependence graphs with variable task execution times onto heterogeneous multiprocessors. In: Proceedings of the 8th ACM international conference on Embedded software, EMSOFT ’08, pp. 149–158, New York, NY, USA. ACM (2008)Google Scholar
  20. 20.
    Zhu, X., Malik, S.: Using a communication architecture specification in an application-driven retargetable prototyping platform for multiprocessing. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’04, pp. 1244–1249 (2004)Google Scholar
  21. 21.
    Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., Sen, R., Sewell, K., Shoaib, M., Vaish, N., Hill, M.D., Wood, D.A.: The Gem5 simulator. SIGARCH Comput. Archit. News 39(2), 1–7 (2011)CrossRefGoogle Scholar
  22. 22.
    Miele, A., Pilato, C., Sciuto, D.: A simulation-based framework for the exploration of mapping solutions on heterogeneous MPSoCs. Int. J. Embed. Real Time Commun. Syst. 4(1), 22–41 (2013)CrossRefGoogle Scholar
  23. 23.
    Lin, K.-L., Lo, C.-K., Tsay, R.-S.: Source-level timing annotation for fast and accurate tlm computation model generation. In: Design Automation Conference (ASP-DAC), 2010 15th Asia and South Pacific, pp. 235–240, (2010)Google Scholar
  24. 24.
    Wilson, R., French, R., Wilson, C., Amarasinghe, S., Anderson, J., T. S., Liao, S., Tseng, C., Hall, M., Lam, M., Hennessy, J.: The SUIF Compiler System: a Parallelizing and Optimizing Research Compiler. Technical report, Stanford, CA, USA (1994)Google Scholar
  25. 25.
    Kreku, J., Tiensyrjä, K., Vanmeerbeeck, G.: Automatic workload generation for system-level exploration based on modified GCC compiler. In: Proceedings of the Conference on Design, Automation and Test in Europe, Date ’10, pp. 369–374, (2010)Google Scholar
  26. 26.
    Javaid, H., Janapsatya, A., Haque, M.S., Parameswaran, S.: Rapid runtime estimation methods for pipelined MPSoCs. In: Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’10, pp. 363–368 (2010)Google Scholar
  27. 27.
    Cordes, D., Marwedel, P., Mallik, A.: Automatic parallelization of embedded software using hierarchical task graphs and integer linear programming. In: Proceedings of the Eighth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES/ISSS ’10, pp. 267–276 (2010)Google Scholar
  28. 28.
    Kim, S., Ha, S.: System-level performance analysis of multiprocessor system-on-chips by combining analytical model and execution time variation. Microprocess. Microsyst. 38(3), 233–245 (2014)Google Scholar
  29. 29.
    Kumar, A., Mesman, B., Corporaal, H., Ha, Y.: Iterative probabilistic performance prediction for multi-application multiprocessor systems. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29(4), 538–551 (2010)CrossRefGoogle Scholar
  30. 30.
    Xu, Y., Wang, B., Hasholzner, R., Rosales, R., Teich, J.: On robust task-accurate performance estimation. In: Proceedings of the 50th Annual Design Automation Conference, DAC ’13, ACM, New York, NY, USA, pp. 171:1–171:6 (2013)Google Scholar
  31. 31.
    Ernst, R., Ye, W.: Embedded program timing analysis based on path clustering and architecture classification. In: Proceedings of the 1997 IEEE/ACM International Conference on Computer-Aided Design, ICCAD ’97, pp. 598–604, (1997)Google Scholar
  32. 32.
    Malik, S., Martonosi, M., Li, Y.S.: Static timing analysis of embedded software. In Proceedings of the 34th Annual Design Automation Conference, DAC ’97, pp. 147–152 (1997)Google Scholar
  33. 33.
    Zhai, A., Colohan, C.B., Steffan, J.G., Mowry, T.C.: Compiler optimization of scalar value communication between speculative threads. In: Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS-X, pp. 171–183 (2002)Google Scholar
  34. 34.
    Ferrandi, F., Lattuada, M., Pilato, C., Tumeo, A.: Performance estimation for task graphs combining sequential path profiling and control dependence regions. In: Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign, MEMOCODE ’09, pp. 131–140 (2009)Google Scholar
  35. 35.
    Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc, Melbourne (1986)zbMATHGoogle Scholar
  36. 36.
    Sreedhar, V.C., Gao, G.R., Lee, Y.: Identifying loops using DJ graphs. ACM Trans. Program. Lang. Syst. 18(6), 649–658 (1996)CrossRefGoogle Scholar
  37. 37.
    Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst. 9(3), 319–349 (1987)CrossRefzbMATHGoogle Scholar
  38. 38.
    Girkar, M., Polychronopoulos, C.: Automatic extraction of functional parallelism from ordinary programs. IEEE Trans. Parallel Distrib. Syst. 3(2), 166–178 (1992)CrossRefGoogle Scholar
  39. 39.
    Bertels, K., Sima, V., Yankova, Y., Kuzmanov, G., Luk, W., Coutinho, G., Ferrandi, F., Pilato, C., Lattuada, M., Sciuto, D., Michelotti, A.: Hartes: Hardware-software codesign for heterogeneous multicore platforms. IEEE Micro. 30, 88–97 (2010)CrossRefGoogle Scholar
  40. 40.
    Thompson, M., Nikolov, H., Stefanov, T., Pimentel, A.D., Erbas, C., Polstra, S., Deprettere, E.F.: A framework for rapid system-level exploration, synthesis, and programming of multimedia MP-SoCs. In: Proceedings of the IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’07, pp. 9–14 (2007)Google Scholar
  41. 41.
    Atmel Corporation. DIOPSIS 940HF. (2009)
  42. 42.
    Texas Instruments. TI OMAP 4. (2011)
  43. 43.
    Xilinx. Vivado Design Suite. (2013)
  44. 44.
    Gerstlauer, A.: Host-compiled simulation of multi-core platforms. In: Proceedings of the IEEE International Symposium on Rapid System Prototyping (RSP), pp. 1–6 (June 2010)Google Scholar
  45. 45.
    Synopsys Inc. Platform Architect. (2012)
  46. 46.
    Oyamada, M.S., Zschornack, F., Wagner, F.R.: Applying neural networks to performance estimation of embedded software. J. Syst. Architect. 54(1–2), 224–240 (2008)CrossRefGoogle Scholar
  47. 47.
    PandA. PandA framework.
  48. 48.
    GNU Compiler Collection. GCC, version 4.3.
  49. 49.
    Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown. R.B.: MiBench: A free, commercially representative embedded benchmark suite. In: Proceedings of the IEEE International Workshop on Workload Characterization, WWC ’01, pp. 3–14 (2001)Google Scholar
  50. 50.
    Dorta, A.J., Rodriguez, C., de Sande, F., Gonzalez-Escribano, A.: The OpenMP Source Code Repository. In: Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP ’05, pp. 244–250 (2005)Google Scholar
  51. 51.
    Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A.: The SPLASH-2 programs: characterization and methodological considerations. In: Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, pp. 24–36 (1995)Google Scholar
  52. 52.
    ARM922T. Technical Reference Manual.
  53. 53.
    Politecnico di Milano. ReSP web-site. (2010)

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Marco Lattuada
    • 1
    Email author
  • Christian Pilato
    • 1
    • 2
  • Fabrizio Ferrandi
    • 1
  1. 1.Dipartimento di Elettronica, Informazione e BioingegneriaPolitecnico di MilanoMilanoItaly
  2. 2.Department of Computer ScienceColumbia UniversityNew YorkUSA

Personalised recommendations