Towards Reversible Basic Linear Algebra Subprograms: A Performance Study

Part of the Lecture Notes in Computer Science book series (LNCS, volume 8911)


Problems such as fault tolerance and scalable synchronization can be efficiently solved using reversibility of applications. Making applications reversible by relying on computation rather than on memory is ideal for large scale parallel computing, especially for the next generation of supercomputers in which memory is expensive in terms of latency, energy, and price. In this direction, a case study is presented here in reversing a computational core, namely, Basic Linear Algebra Subprograms (BLAS), which is widely used in scientific applications. A new Reversible BLAS (RBLAS) library interface has been designed, and a prototype has been implemented with two modes: (1) a memory-mode in which reversibility is obtained by checkpointing to memory, and (2) a computational-mode in which nothing is saved, and restoration is done entirely via inverse computation. The article is focused on detailed performance benchmarking to evaluate the runtime dynamics and performance effects, comparing reversible computation with checkpointing on both traditional CPU platforms and recent GPU accelerator platforms. For BLAS Level-1 subprograms, data indicates over an order of magnitude speed up of reversible computation compared to checkpointing. For BLAS Level-2 and Level-3, a more complex tradeoff is observed between reversible computation and checkpointing, depending on computational and memory complexities of the subprograms.


Reversible computation Linear algebra Checkpointing Runtime performance Memory effects 


  1. 1.
    Cublas: Common Unified Data Architecture Basic Linear Algebra Subprograms (2012).
  2. 2.
    Acml: Advanced micro devices core math library (2013).
  3. 3.
    Barnes, P., Carothers, C., Jefferson, D., LaPre, J.: Warp speed: executing time warp on 1,966,080 cores. In: Proceedings of the ACM SIGSIM Principles of Advanced Discrete Simulation (2013)Google Scholar
  4. 4.
    Besseron, X., Gautier, T.: Impact of over-decomposition on coordinated checkpoint/rollback protocol. In: Alexander, M., D’Ambra, P., Belloum, A., Bosilca, G., Cannataro, M., Danelutto, M., Di Martino, B., Gerndt, M., Jeannot, E., Namyst, R., Roman, J., Scott, S.L., Traff, J.L., Vallée, G., Weidendorfer, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 7156, pp. 322–332. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Bessho, N., Dohi, T.: Comparing checkpoint and rollback recovery schemes in a cluster system. In: Xiang, Y., Stojmenovic, I., Apduhan, B.O., Wang, G., Nakano, K., Zomaya, A. (eds.) ICA3PP 2012, Part I. LNCS, vol. 7439, pp. 531–545. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Dongarra, J., Duff, I., DuCroz, J., Hammarling, S.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1–28 (1990)CrossRefzbMATHGoogle Scholar
  7. 7.
    Frank, M.: Introduction to reversible computing: motivation, progress, and challenges. In: International Workshop on Reversible Computing (Special Session at ACM Computing Frontiers) (2005)Google Scholar
  8. 8.
    Goto, K., Van de Geijn, R.: High-performance implementation of the level-3 blas. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)CrossRefGoogle Scholar
  9. 9.
    He, Y., Ding, C.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. Springer J. Supercomput. 18, 259–277 (2001)CrossRefzbMATHGoogle Scholar
  10. 10.
    Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw. 5, 308–325 (1979)CrossRefzbMATHGoogle Scholar
  11. 11.
    Li, X., Demmel, J., Baile, D., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kapur, A., Martin, M.C., Tung, T., Yoo, D.J.: Design, implementation and testing of extended and mixed precision blas. ACM Trans. Math. Softw. 28(2), 206–238 (2002)CrossRefGoogle Scholar
  12. 12.
    Perumalla, K., Park, A.: Reverse computation for rollback-based fault tolerance in large parallel systems. Cluster Comput. 17(2), 303–313 (2014)CrossRefGoogle Scholar
  13. 13.
    Perumalla, K., Park, A., Tipparaju, V.: Discrete event execution with one-sided and two-sided GVT algorithms on 216,000 processor cores. ACM Trans. Model. Comput. Simul. 24(3), 16:1–16:25 (2014)CrossRefGoogle Scholar
  14. 14.
    Perumalla, K.S.: Introduction to Reversible Computing. Computational Science Series. Chapman Hall/CRC Press, Boca Raton (2013). ISBN: 978-1439873403Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  1. 1.Oak Ridge National LaboratoryOak RidgeUSA

Personalised recommendations