Abstract
In the last years, the portability term has enriched itself with new meanings: research communities are talking about how to measure the degree to which an application (or library, programming model, algorithm implementation, etc.) has become “performance portable”. The term “performance portability” has been informally used in computing communities to substantially refer to: (1) the ability to run one application across multiple hardware platforms; and (2) achieving some decent level of performance on these platforms [1, 2]. Among the efforts related to the “performance portability” issue, we note the annual performance portability workshops organized by the US Department of Energy [3]. This article intends to add a new point of view to the performance portability issue, starting from a more theoretical point of view, that shows the convenience of splitting the proper algorithm from the emphoverhead, and exploring the different factors that introduce different kind of overhead. The paper explores the theoretical framework to get a definition of the execution time of a software but that definition is not the point. The aim is to show and understand the link between that execution time and the beginning of the design, to exploit what part of any program is really environment-sensitive and exclude from performance portability formulas everything is not going to change, as theoretically shown.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Decomposition matrix is the name we preferred in this work, but in [8] it is referred as dependency matrix.
- 2.
These can be basic operations (arithmetic,\(\ldots \)), special functions evaluations (\(\sin ,\cos ,\ldots \)), solvers (integrals, equations system, non-linear equations\(\ldots \)).
- 3.
For the general case, look at [12].
- 4.
This assumption is necessary to compare two algorithms.
- 5.
Scale Up is defined in [8] as the ratio \(SC(D_{k_i},D_{k_j}):=\frac{k_i}{k_j}\) and it measures the difference between the two algorithm respect to the number of operations they perform to solve the same problem.
- 6.
This is an initial, not realistic, assumption.
- 7.
There is no loss of generality because any operator can be rewritten as a number of elementary operators with execution time \(t_{calc}\).
- 8.
This is a semplified and very general logical description of a memory hierarchy behavior useful to the aim of the framework. Of course it could be adapted to an actual architecture, but the following definitions hold the same.
- 9.
Level 0 is the fastest one.
- 10.
In general \(c_{AM}\le nd\), but we can assume \(c_{AM}=nd\) without loss of generality.
- 11.
Meanly.
- 12.
Meanly.
- 13.
For example: in case of an algorithm like the one in [13], where the architecture is a heterogeneous GPU and Multicore based system, we can build different matrices for different parts of the algorithm.
References
Pennycook, S.J., Sewall, J.D., Lee, V.W.: Implications of a metric for performance portability. Future Gener. Comput. Syst. 92, 947–958 (2017). https://doi.org/10.1016/j.future.2017.08.007
Kwack, J., et al.: Evaluating performance portability of HPC applications and benchmarks across diverse HPC architectures. Exascale Computing Project (ECP) Webinar. https://www.exascaleproject.org/event/performance-portability-evaluation/. Accessed 20 May 2020
DOE centres of excellence performance portability meeting: post-meeting report technical report LLNL-TR-700962. Lawrence Livermore National Laboratory, Livermore (2016). https://asc.llnl.gov/sites/asc/files/2020-09/COE-PP-Meeting-2016-FinalReport_0.pdf
Carracciuolo, L., Mele, V., Szustak, L.: About the granularity portability of block-based Krylov methods in heterogeneous computing environments. Concurr. Comput. Pract. Exp. 33(4), e6008 (2021). https://doi.org/10.1002/cpe.6008
Neely, J.R.: DOE centers of excellence performance portability meeting. Technical report LLNL-TR-700962, 4. Lawrence Livermore National Laboratory (2016). https://doi.org/10.2172/1332474
Edwards, H.C., Trott, C.R., Sunderland, D.: Kokkos: enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74(12), 3202–3216 (2014). https://doi.org/10.1016/j.jpdc.2014.07.003
Pennycook, J., Sewall, J., Jacobsen, D.W., Deakin, T., McIntosh-Smith, S.N.: Navigating performance, portability and productivity. Comput. Sci. Eng. 23(5), 28–38 (2021). https://doi.org/10.1109/MCSE.2021.3097276
Mele, V., Romano, D., Constantinescu, E.M., Carracciuolo, L., D’Amore, L.: Performance evaluation for a PETSc parallel-in-time solver based on the MGRIT algorithm. In: Mencagli, G., et al. (eds.) Euro-Par 2018. LNCS, vol. 11339, pp. 716–728. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-10549-5_56
D’Amore, L., Mele, V., Laccetti, G., Murli, A.: Mathematical approach to the performance evaluation of matrix multiply algorithm. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9574, pp. 25–34. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32152-3_3
Mele, V., Constantinescu, E.M., Carracciuolo, L., D’amore, L.: A PETSc parallel-in-time solver based on MGRIT algorithm. Concurr. Comput. Pract. Exp. 30(24), e4928 (2018). https://doi.org/10.1002/cpe.4928
D’Amore, L., Mel, V., Romano, D., Laccetti, G.: Multilevel algebraic approach for performance analysis of parallel algorithms. Comput. Inform. 38(4), 817–850 (2019). https://doi.org/10.31577/cai_2019_4_817
Romano, D., Lapegna, M., Mele, V., Laccetti, G.: Designing a GPU-parallel algorithm for raw SAR data compression: a focus on parallel performance estimation. Future Gener. Comput. Syst. 112(6), 695–708 (2020). https://doi.org/10.1016/j.future.2020.06.027
Laccetti, G., Lapegna, M., Mele, V., Romano, D.: A study on adaptive algorithms for numerical quadrature on heterogeneous GPU and multicore based systems. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds.) PPAM 2013. LNCS, vol. 8384, pp. 704–713. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-55224-3_66
Laccetti, G., Lapegna, M., Mele, V.: A loosely coordinated model for heap-based priority queues in multicore environments. Int. J. Parallel Prog. 44(4), 901–921 (2015). https://doi.org/10.1007/s10766-015-0398-x
Laccetti, G., Lapegna, M., Mele, V., Montella, R.: An adaptive algorithm for high-dimensional integrals on heterogeneous CPU-GPU systems. Concurr. Comput. Pract. Exp. 31(19), e4945 (2019). https://doi.org/10.1002/cpe.4945
Montella, R., Giunta, G., Laccetti, G.: Virtualizing high-end GPGPUs on ARM clusters for the next generation of high performance cloud computing. Cluster Comput. 17(1), 139–152 (2014). https://doi.org/10.1007/s10586-013-0341-0
Marcellino, L., et al.: Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) PPAM 2017. LNCS, vol. 10778, pp. 14–24. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78054-2_2
D’Amore, L., Campagna, R., Mele, V., Murli, A., Rizzardi, M.: ReLaTIve. An Ansi C90 software package for the Real Laplace Transform Inversion. Numerical Algorithms 63(1), 187–211 (2013). https://doi.org/10.1007/s11075-012-9636-0
D’Amore, L., Campagna, R., Mele, V., Murli, A.: Algorithm 946. ReLIADiff. An C++ software package for real Laplace transform inversion based on automatic differentiation. ACM Trans. Math. Softw. 40(4), 31:1–31:20 (2014). Article 31. https://doi.org/10.1145/2616971
D’Amore, L., Mele, V., Campagna, R.: Quality assurance of Gaver’s formula for multi-precision Laplace transform inversion in real case. Inverse Probl. Sci. Eng. 26(4), 553–580 (2018). https://doi.org/10.1080/17415977.2017.1322963
Tjaden. G.S., Flynn. M.J.: Detection and parallel execution of independent instructions. IEEE Trans. Comput. C-19(10), 889–895 (1970). https://doi.org/10.1109/T-C.1970.222795
Flatt, H.P., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12(1), 1–20 (1989). https://doi.org/10.1016/0167-8191(89)90003-3
Maddalena, L., Petrosino, A., Laccetti, G.: A fusion-based approach to digital movie restoration. Pattern Recogn. 42(7), 1485–1495 (2009). https://doi.org/10.1016/j.patcog.2008.10.026
Hockney, R.W.: The Science of Computer Benchmarking. SIAM (1996)
Ballard, G., Demmel, J., Knight, N.: Avoiding communication in successive band reduction. ACM Trans. Parallel Comput. 1(2), 37 (2015). Article 11. https://doi.org/10.1145/2686877
Koanantakool, P., et al.: Communication-avoiding parallel sparse-dense matrix-matrix multiplication. In: IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 842–853 (2016). https://doi.org/10.1109/IPDPS.2016.117
Sao, P., Kannan, R., Li, X.S., Vuduc, R.: A communication-avoiding 3D sparse triangular solver. In: Proceedings of the ACM International Conference on Supercomputing (ICS 2019), pp. 127–137. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3330345.3330357
Kennedy, K., McKinley, K.S.: Optimizing for parallelism and data locality. In: Proceedings of the 6th International Conference on Supercomputing (ICS 1992), pp. 323–334. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/143369.143427
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mele, V., Laccetti, G. (2023). Algorithm and Software Overhead: A Theoretical Approach to Performance Portability. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-30445-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30444-6
Online ISBN: 978-3-031-30445-3
eBook Packages: Computer ScienceComputer Science (R0)