An Out-of-Core Task-based Middleware for Data-Intensive Scientific Computing

  • Erik SauleEmail author
  • Hasan Metin Aktulga
  • Chao Yang
  • Esmond G. Ng
  • Ümit V. Çatalyürek


In datacenters, non-volatile memory storages are experiencing a fast adoption rate due to the high bandwidth and low latency advantages that they provide over the traditional disk-based storage systems in the management and analysis of large datasets. The drastic changes in system architecture will require rethinking systems software as well. Specifically, with improvements in hardware performance, software efficiency will become the next bottleneck. Here, we present an out-of-core task-based middleware together with a domain specific application interface, which will increase the programmer's productivity while still ensuring good performance and scalability by enabling the separation of computation and data movement.


Task Graph Storage Service Solid State Drive Local Scheduler Global Scheduler 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the new normal in computer architecture,” Computing in Science Engineering, vol. PP, no. 99, pp. 1–1, 2013.Google Scholar
  2. 2.
    P. Ranganathan and J. Chang, “(Re)designing data-centric data centers,” Micro, IEEE, vol. 32, no. 1, pp. 66–70, 2012.CrossRefGoogle Scholar
  3. 3.
    E. Barragy, B. Brantley, S. Gurumurthi, M. Ignatowski, N. Jayasena, A. Lee, G. Loh, S. Manne, M. O’Connor, P. Popescu, S. Reinhardt, and M. Schulte, “Amd’s fastforward extreme-scale computing processor and memory research,” in US DOE Exascale Research Conference, Arlington, VA, USA, Oct. 2012.Google Scholar
  4. 4.
    R. Nair, J. Moreno, and D. Joseph, “Advanced memory concepts for exascale systems,” in US DOE Exascale Research Conference, Arlington, VA, USA, Oct. 2012.Google Scholar
  5. 5.
    Y.-K. Kwok and I. Ahmad, “Static scheduling algorithms for allocating directed task graphs to multiprocessors,” ACM Comput. Surv., vol. 31, no. 4, pp. 406–471, Dec. 1999.Google Scholar
  6. 6.
    C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures,” Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009, vol. 23, pp. 187–198, Feb. 2011.Google Scholar
  7. 7.
    G. Bosilca, A. Bouteiller, A. Danalis, T. Hérault, P. Lemarinier, and J. Dongarra, “DAGuE: A generic distributed DAG engine for high performance computing,” Parallel Computing, vol. 38, no. 1–2, pp. 37–51, 2012.Google Scholar
  8. 8.
    G. Bosilca, M. Faverge, X. Lacoste, I. Yamazaki, and P. Ramet, “Toward a supernodal sparse direct solver over DAG runtimes,” in Proceedings of PMAA'2012, London, UK, Jun. 2012.Google Scholar
  9. 9.
    A.-E. Hugo, A. Guermouche, R. Namyst, and P.-A. Wacrenier, “Composing multiple StarPU applications over heterogeneous machines: a supervised approach,” in Third International Workshop on Accelerators and Hybrid Exascale Systems, Boston, États-Unis, May 2013.Google Scholar
  10. 10.
    C. Augonnet, O. Aumage, N. Furmento, R. Namyst, and S. Thibault, “StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators,” in EuroMPI 2012, ser. LNCS, S. B. Jesper Larsson Träff and J. Dongarra, Eds., vol. 7490. Springer, Sep. 2012, poster Session.Google Scholar
  11. 11.
    M. Cosnard and M. Loi, “Automatic task graph genera tion techniques,” Parallel Processing Letters, vol. 5, no. 4, p. 527–538, 1995.CrossRefGoogle Scholar
  12. 12.
    M. Cosnard, E. Jeannot, and T. Yang, “Slc: Symbolic scheduling for executing parameterized task graphs on multiprocessors,” in Proc. ICPP, 1999.Google Scholar
  13. 13.
    S. Toledo, “A survey of out-of-core algorithms in numerical linear algebra,” in External memory algorithms, J. M. Abello and J. S. Vitter, Eds. Boston, MA, USA: American Mathematical Society, 1999, pp. 161–179.Google Scholar
  14. 14.
    J. K. Reid and J. A. Scott, “An out-of-core sparse cholesky solver,” ACM Trans. Math. Softw., vol. 36, no. 2, 2009.Google Scholar
  15. 15.
    V. Rotkin and S. Toledo, “The design and implementation of a new out-of-core sparse cholesky factorization method,” ACM Trans. Math. Softw., vol. 30, no. 1, pp. 19–46, 2004.Google Scholar
  16. 16.
    P. R. Amestoy, I. S. Duff, Y. Robert, F.-H. Rouet, and B. Ucar, “On computing inverse entries of a sparse matrix in an out-of-core environment,” CERFACS, Tech. Rep. TR/PA/10/59, 2010.Google Scholar
  17. 17.
    J. A. Scott, “Scaling and pivoting in an out-of-core sparse direct solver,” ACM Trans. Math. Softw., vol. 37, no. 2, 2010.Google Scholar
  18. 18.
    E. Agullo, A. Guermouche, and J.-Y. L’Excellent, “A parallel out-of-core multifrontal method: Storage of factors on disk and analysis of models for an out-of-core active memory,” Parallel Computing, Special Issue on Parallel Matrix Algorithms, no. 6–8, 2008.Google Scholar
  19. 19.
    E. Agullo, A. Guermouche, and J.-Y. L’Excellent, “Reducing the I/O Volume in Sparse Out-of-core Multifrontal Methods,” SIAM Journal on Scientific Computing, no. 6, 2010.Google Scholar
  20. 20.
    W. J. Knottenbelt and P. G. Harrison, “Distributed disk-based solution techniques for large markov models,” in Proc. of Numerical Solution of Markov Chains, 1999.Google Scholar
  21. 21.
    Y.-Y. Chen, Q. Gan, and T. Suel, “Local methods for estimating pagerank values,” in Proceedings of the thirteenth ACM international conference on Information and knowledge management, ser. CIKM '04. New York, NY, USA: ACM, 2004, pp. 381–389.Google Scholar
  22. 22.
    E. Saule, P.-F. Dutot, and G. Mounié, “Scheduling With Storage Constraints,” in Proc of IPDPS'08, Apr. 2008, conference, acceptance rate: 25.6%.Google Scholar
  23. 23.
    S. S. Tse, “Online bicriteria load balancing using object reallocation,” IEEE Transactions on Parallel and Distributed Systems, vol. 20, no. 3, pp. 379–388, 2009.Google Scholar
  24. 24.
    Ü. V. Çatalyürek, K. Kaya, and B. Uçar, “Integrated data placement and task assignment for scientific workflows in clouds,” in The Fourth International Workshop on Data Intensive Distributed Computing (DIDC 2011), in conjunction with the 20th International Symposium on High Performance Distributed Computing (HPDC 2011), Jun 2011.Google Scholar
  25. 25.
    R. Sethi, “Pebble games for studying storage sharing.” Theor. Comput. Sci., vol. 19, pp. 69–84, 1982.Google Scholar
  26. 26.
    S. Biswas and S. Kannan, “Minimizing space usage in evaluation of expression trees,” in Foundations of Software Technology and Theoretical Computer Science, ser. Lecture Notes in Computer Science, P. Thiagarajan, Ed. Springer Berlin Heidelberg, 1995, vol. 1026, pp. 377–390.Google Scholar
  27. 27.
    C.-C. Lam, D. Cociorva, G. Baumgartner, and P. Sadayappan, “Memory-optimal evaluation of expression trees involving large objects,” in High Performance Computing – HiPC'99, ser. Lecture Notes in Computer Science, P. Banerjee, V. Prasanna, and B. Sinha, Eds. Springer Berlin Heidelberg, 1999, vol. 1745, pp. 103–110.Google Scholar
  28. 28.
    V. Rehn-Sonigo, D. Trystram, F. Wagner, H. Xu, and G. Zhang, “Offline scheduling of multi-threaded request streams on a caching server,” in IPDPS, 2011, pp. 1167–1176.Google Scholar
  29. 29.
    M. Jacquelin, L. Marchal, Y. Robert, and B. Uçar, “On optimal tree traversals for sparse matrix factorization,” in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 556–567.Google Scholar
  30. 30.
    L. Marchal, O. Sinnen, and F. Vivien, “Scheduling tree-shaped task graphs to minimize memory and makespan,” INRIA, Rapport de recherche RR-8082, Oct. 2012.Google Scholar
  31. 31.
    Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, and Ü. V. Çatalyürek, “An out-of-core dataflow middleware to reduce the cost of large scale iterative solvers,” in 2012 International Conference on Parallel Processing (ICPP) Workshops, Fifth International Workshop on Parallel Programming Models and Systems Software for High-End Computing (P2S2), Sep 2012.Google Scholar
  32. 32.
    M. D. Beynon, T. Kurc, Ü. V. Çatalyürek, C. Chang, A. Sussman, and J. Saltz, “Distributed processing of very large datasets with DataCutter,” Parallel Computing, vol. 27, no. 11, pp. 1457–1478, Oct. 2001.Google Scholar
  33. 33.
    Z. Zhou, E. Saule, H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, J. P. Vary, and Ü. V. Çatalyürek, “An out-of-core eigensolver on SSD-equipped clusters,” in Proc. of IEEE Cluster, Sep. 2012.Google Scholar
  34. 34.
    J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Apra, “Advances, applications and performance of the global arrays shared memory programming toolkit,” International Journal of High Performance Computing Applications, vol. 20, pp. 203–231, 2006.Google Scholar
  35. 35.
    P. Maris, H. M. Aktulga, M. A. Caprio, Ü. V. Çatalyürek, E. G. Ng, D. Oryspayev, H. Potter, E. Saule, M. Sosonkina, J. P. Vary et al., “Large-scale ab initio configuration interaction calculations for light nuclei,” Journal of Physics: Conference Series, vol. 403, no. 1, p. 012019, 2012.Google Scholar
  36. 36.
    P. Maris, H. M. Aktulga, S. Binder, A. Calci, Ü. V. Çatalyürek, J. Langhammer, E. Ng, E. Saule, R. Roth, J. P. Vary, and C. Yang, “No-Core CI calculations for light nuclei with chiral 2- and 3-body forces,” Journal of Physics: Conference Series, vol. 454, no. 1, p. 012063, 2013.Google Scholar
  37. 37.
    H. M. Aktulga, C. Yang, E. G. Ng, P. Maris, and J. P. Vary, “Improving the scalability of a symmetric iterative eigensolver for multi-core platforms,” Concurrency and Computation: Practice and Experience, p. in press, 2013.Google Scholar
  38. 38.
    P. Sternberg, E. G. Ng, C. Yang, P. Maris, J. P. Vary, M. Sosonkina, and H. V. Le, “Accelerating configuration interaction calculations for nuclear structure,” in Proc. of SC08, 2008.Google Scholar
  39. 39.
    A. V. Knyazev, “Toward the optimal preconditioned eigensolver: Locally optimal block preconditioned conjugate gradient method,” SIAM Journal on Scientific Computing, vol. 23, no. 2, pp. 517–541, 2001.Google Scholar
  40. 40.
    F. B. Schmuck and R. L. Haskin, “GPFS: A shared-disk file system for large computing clusters,” in Proc. of FAST'02, 2002, pp. 231–244.Google Scholar
  41. 41.
    M. Jung, E. H. W. III, W. Choi, J. Shalf, H. M. Aktulga, C. Yang, E. Saule, Ü. V. Çatalyürek, and M. Kandemir, “Exploring the future of out-of-core computing with compute-local non-volatile memory,” in Proc. of Conference on High Performance Computing Networking, Storage and Analysis (SC '13), Nov 2013.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Erik Saule
    • 1
    Email author
  • Hasan Metin Aktulga
    • 2
  • Chao Yang
    • 2
  • Esmond G. Ng
    • 2
  • Ümit V. Çatalyürek
    • 3
  1. 1.Department of Computer ScienceUniversity of North Carolina at CharlotteCharlotteUSA
  2. 2.Computational Research DivisionLawrence Berkeley National LaboratoryBerkeleyUSA
  3. 3.Department of Biomedical InformaticsThe Ohio State UniversityColumbusUSA

Personalised recommendations