Skip to main content

MATE, a Unified Model for Communication-Tolerant Scientific Applications

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2018)


We present MATE, a model for developing communication-tolerant scientific applications. MATE employs a combination of mechanisms to reduce or hide the cost of network and intra-node data movement. While previous approaches have been proposed to reduce both sources of communication overhead separately, the contribution of MATE is demonstrating the symbiotic effect of reducing both forms of data movement taken together. Furthermore, MATE provides these benefits within a single unified model, as opposed to hybrid (e.g., MPI+X) approaches. We demonstrate MATE’s effectiveness in reducing the cost of communication in three scientific computing motifs on up to 32k cores of the NERSC Cori Phase I supercomputer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    Other approaches include communication reordering [27], concurrency optimizations [16], and communication avoiding algorithms [13].

  2. 2.

    Sam Williams, private conversation, 2018.






  5. Cray MPI.

  6. Intel MPI library.

  7. MPICH library.

  8. MVAPICH library.

  9. Open MPI library.

  10. Arvind, K., Nikhil, R.S.: Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39(3), 300–318 (1990).

    Article  MATH  Google Scholar 

  11. Babb, R.G.: Parallel processing with large-grain data flow technique. Computer 17(7), 55–61 (1984)

    Article  Google Scholar 

  12. Bachan, J., et al.: The UPC++ PGAS library for exascale computing: extended abstract. In: PAW17: Second Annual PGAS Applications Workshop, p. 4. ACM, New York, 12–17 November 2017.

  13. Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, 1–155 (2014)

    Article  MathSciNet  Google Scholar 

  14. Barrett, R.F., Stark, D.T., Vaughan, C.T., Grant, R.E., Olivier, S.L., Pedretti, K.T.: Toward an evolutionary task parallel integrated MPI + X programming model. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 30–39. ACM, New York (2015).

  15. Cannon, L.E.: A Cellular computer to implement the Kalman filter algorithm. Ph.D. thesis, Bozeman, MT, USA (1969). aAI7010025

    Google Scholar 

  16. Chaimov, N., Ibrahim, K.Z., Williams, S., Iancu, C.: Exploiting communication concurrency on high performance computing systems. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015, pp. 132–143. ACM, New York (2015).

  17. Debudaj-Grabysz, A., Rabenseifner, R.: Nesting OpenMP in MPI to implement a hybrid communication method of parallel simulated annealing on a cluster of SMP nodes. In: Di Martino, B., Kranzlmüller, D., Dongarra, J. (eds.) EuroPVM/MPI 2005. LNCS, vol. 3666, pp. 18–27. Springer, Heidelberg (2005).

    Chapter  Google Scholar 

  18. Dennis, J.: Data flow supercomputers. IEEE Comput. 13(11), 48–56 (1980)

    Article  Google Scholar 

  19. Hoefler, T., et al.: MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing 95, 1121–1136 (2013).

    Article  Google Scholar 

  20. Huang, C., Lawlor, O., Kalé, L.V.: Adaptive MPI. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 306–322. Springer, Heidelberg (2004).

    Chapter  Google Scholar 

  21. Iancu, C., Hofmeyr, S., Blagojević, F., Zheng, Y.: Oversubscription on multicore processors. In: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp. 1–11 (April 2010).

  22. Quinlan, D.: ROSE: compiler support for object-oriented frameworks. Parallel Process. Lett. 10, 215–226 (2000)

    Article  Google Scholar 

  23. Kalé, L.V.: The virtualization approach to parallel programming: runtime optimizations and the state of the art. In: Los Alamos Computer Science Institute Symposium-LACSI (2002)

    Google Scholar 

  24. Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++. In: Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA 1993, pp. 91–108. ACM, New York (1993).

  25. Kamal, H., Wagner, A.: FG-MPI: fine-grain MPI for multicore and clusters. In: 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8, April 2010.

  26. Krishnamurthy, A., et al.: Parallel programming in split-C. In: Proceedings of the 1993 ACM/IEEE Conference on Supercomputing, Supercomputing 1993, pp. 262–273. ACM, New York (1993).

  27. Lavrijsen, W., Iancu, C.: Application level reordering of remote direct memory access operations. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 988–997, May 2017.

  28. Lu, H., Seo, S., Balaji, P.: MPI+ULT: overlapping communication and computation with user-level threads. In: 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pp. 444–454, August 2015.

  29. Marjanović, V., Labarta, J., Ayguadé, E., Valero, M.: Overlapping communication and computation by using a hybrid MPI/SMPSS approach. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS 2010, pp. 5–16. ACM, New York (2010).

  30. Martin, S.M., Berger, M.J., Baden, S.B.: Toucan - a translator for communication tolerant MPI applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 998–1007, May 2017.

  31. NERSC: National Energy Research Scientific Computing Center.

  32. Nguyen, T., Cicotti, P., Bylaska, E., Quinlan, D., Baden, S.B.: Bamboo - translating MPI applications to a latency-tolerant, data-driven form. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11, November 2012.

  33. OpenMP, ARB: OpenMP 4.0 specification (2013)

    Google Scholar 

  34. Perez, J.M., Badia, R.M., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: 2008 IEEE International Conference on Cluster Computing, pp. 142–151, September 2008.

  35. Tang, H., Yang, T.: Optimizing threaded MPI execution on SMP clusters. In: Proceedings of the 15th International Conference on Supercomputing, ICS 2001, pp. 381–392. ACM, New York (2001).

  36. Terpstra, D., Jagode, H., You, H., Dongarra, J.: Collecting performance data with PAPI-C. In: Müller, M.S., Resch, M.M., Schulz, A., Nagel, W.E. (eds.) Tools for High Performance Computing 2009, pp. 157–173. Springer, Heidelberg (2010).

    Chapter  Google Scholar 

  37. Tomasulo, R.M.: An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11(1), 25–33 (1967).

    Article  MATH  Google Scholar 

  38. Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990).

    Article  Google Scholar 

  39. Zhang, Q., Johansen, H., Colella, P.: A fourth-order accurate finite-volume method with structured adaptive mesh refinement for solving the advection-diffusion equation. SIAM J. Sci. Comput. 34(2), B179–B201 (2012).

    Article  MathSciNet  MATH  Google Scholar 

Download references


This research was supported by the Advanced Scientific Computing Research office of the U.S. Department of Energy under contracts No. DE-FC02-12ER26118 and DE-FG02-88ER25053. It was also supported in part by the Fulbright Foreign Student Program grant from the U.S. Department of State. Scott Baden dedicates his contributions to this paper to the memory of William Miles Tubbiola (1934–2018).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Sergio M. Martin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Martin, S.M., Baden, S.B. (2019). MATE, a Unified Model for Communication-Tolerant Scientific Applications. In: Hall, M., Sundar, H. (eds) Languages and Compilers for Parallel Computing. LCPC 2018. Lecture Notes in Computer Science(), vol 11882. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34626-3

  • Online ISBN: 978-3-030-34627-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics