The Journal of Supercomputing

, Volume 63, Issue 1, pp 191–217 | Cite as

The Nornir run-time system for parallel programs using Kahn process networks on multi-core machines—a flexible alternative to MapReduce

  • Željko Vrba
  • Pål Halvorsen
  • Carsten Griwodz
  • Paul Beskow
  • Håvard Espeland
  • Dag Johansen
Open Access
Article

Abstract

Even though shared-memory concurrency is a paradigm frequently used for developing parallel applications on small- and middle-sized machines, experience has shown that it is hard to use. This is largely caused by synchronization primitives which are low-level, inherently non-deterministic, and, consequently, non-intuitive to use. In this paper, we present the Nornir run-time system. Nornir is comparable to well-known frameworks such as MapReduce and Dryad that are recognized for their efficiency and simplicity. Unlike these frameworks, Nornir also supports process structures containing branches and cycles. Nornir is based on the formalism of Kahn process networks, which is a shared-nothing, message-passing model of concurrency. We deem this model a simple and deterministic alternative to shared-memory concurrency. Experiments with real and synthetic benchmarks on up to 8 CPUs show that performance in most cases scales almost linearly with the number of CPUs, when not limited by data dependencies. We also show that the modeling flexibility allows Nornir to outperform its MapReduce counterparts using well-known benchmarks.

Keywords

Parallel processing Kahn process networks 

References

  1. 1.
    Allen G, Zucknick P, Evans B (2007) A distributed deadlock detection and resolution algorithm for process networks. In: IEEE international conference on acoustics, speech and signal processing, (ICASSP) 2, April 2007, pp II-33–II-36 Google Scholar
  2. 2.
    Apache Hadoop, Accessed July 2009. http://hadoop.apache.org/
  3. 3.
    Armstrong J (2007) A history of Erlang. In: HOPL III: Proceedings of the 3rd ACM SIGPLAN conference on history of programming languages, pp 6-1–6-26. ACM, New York CrossRefGoogle Scholar
  4. 4.
    Arora NS, Blumofe RD, Plaxton CG (1998) Thread scheduling for multiprogrammed multiprocessors. In: Proceedings of ACM symposium on parallel algorithms and architectures (SPAA). ACM, New York, pp 119–129 CrossRefGoogle Scholar
  5. 5.
    Brooks C, Lee EA, Liu X, Neuendorffer S, Zhao Y, Zheng H (2008) Heterogeneous concurrent modeling and design in Java (vol 1: Introduction to Ptolemy II). Tech rep UCB/EECS-2008-28, EECS Department, University of California, Berkeley, Apr 2008 Google Scholar
  6. 6.
    Buhr PA, Stroobosscher RA (1990) The μ system: providing light-weight concurrency on shared-memory multiprocessor computers running UNIX. Softw Pract Exp 20(9):929–964 CrossRefGoogle Scholar
  7. 7.
    Catalyurek U, Boman E, Devine K, Bozdag D, Heaphy R, Riesen L (2007) Hypergraph-based dynamic load balancing for adaptive scientific computations. In: Proc of 21st international parallel and distributed processing symposium (IPDPS’07). IEEE Press, New York. Also available as Sandia National Labs Tech Report SAND2006-6450C Google Scholar
  8. 8.
    Chaiken R, Jenkins B, Larson P-Å, Ramsey B, Shakib D, Weaver S, Zhou J (2008) Scope: easy and efficient parallel processing of massive data sets. Proc VLDB Endow 1(2):1265–1276 Google Scholar
  9. 9.
    Chih Yang H, Dasdan A, Hsiao R-L, Parker DS (2007) Map-Reduce-Merge: simplified relational data processing on large clusters. In: Proceedings of ACM international conference on management of data (SIGMOD), pp 1029–1040 Google Scholar
  10. 10.
    de Kock E, Essink G, Smits WJM, van der Wolf R, Brunei J-Y, Kruijtzer W, Lieverse P, Vissers KA, Yapi K (2000) Application modeling for signal processing systems. In: Proceedings of design automation conference, pp 402–405 CrossRefGoogle Scholar
  11. 11.
    de Kruijf M, Sankaralingam K (2007) MapReduce for the Cell BE architecture. University of Wisconsin Computer Sciences technical report CS-TR-2007 1625 Google Scholar
  12. 12.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of symposium on operating systems design & implementation (OSDI). USENIX Association, Berkeley, p 10 Google Scholar
  13. 13.
    Dean J, Ghemawat S (2010) System and method for efficient large-scale data processing. US Patent No 7650331, Jan 2010 Google Scholar
  14. 14.
    Gedik B, Andrade H, Wu K-L, Yu PS, Doo M (2008) Spade: the system s declarative stream processing engine. In: SIGMOD ’08: proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1123–1134 CrossRefGoogle Scholar
  15. 15.
    Geilen M, Basten T (2003) Requirements on the execution of Kahn process networks. In: Programming languages and systems, European symposium on programming (ESOP). Springer, Berlin, pp 319–334 CrossRefGoogle Scholar
  16. 16.
    Giacomoni J, Moseley T, Vachharajani M (2008) FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: PPoPP: proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp 43–52 CrossRefGoogle Scholar
  17. 17.
    Gordon MI, Thies W, Amarasinghe S (2006) Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: ASPLOS-XII: proceedings of the 12th international conference on architectural support for programming languages and operating systems. ACM, New York, pp 151–162 CrossRefGoogle Scholar
  18. 18.
    He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: PACT ’08: proceedings of the 17th international conference on parallel architectures and compilation techniques. ACM, New York, pp 260–269 CrossRefGoogle Scholar
  19. 19.
    Hudak P, Hughes J, Jones SP, Wadler P (2007) A history of Haskell: being lazy with class. In: HOPL III: proceedings of the 3rd ACM SIGPLAN conference on history of programming languages, pp 12-1–12-55. ACM, New York CrossRefGoogle Scholar
  20. 20.
    Intel Corporation, Threading building blocks. http://www.threadingbuildingblocks.org
  21. 21.
    Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proc of the ACM SIGOPS/EuroSys European conference on computer systems. ACM, New York, pp 59–72 CrossRefGoogle Scholar
  22. 22.
    Kahn G (1974) The semantics of a simple language for parallel programming. Inf Process 74 Google Scholar
  23. 23.
    Knuth DE (1997) Fundamental Algorithms. The Art of Computer Programming, vol 1. Addison–Wesley, Reading Google Scholar
  24. 24.
    Lämmel R (2007) Google’s MapReduce programming model—revisited. Sci Comput Program 68(3):208–237 Google Scholar
  25. 25.
    Lee EA, Parks T (1995) Dataflow process networks. Proc IEEE 83(5):773–801 CrossRefGoogle Scholar
  26. 26.
    Message passing interface forum, Accessed July 2009. http://www.mpi-forum.org/
  27. 27.
    Olson A, Evans B (2005) Deadlock detection for distributed process networks. In: ICASSP: Proc of IEEE international conference on acoustics, speech, and signal processing, March 2005, vol 5, pp 73–76 Google Scholar
  28. 28.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD ’08: proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, New York, pp 1099–1110 CrossRefGoogle Scholar
  29. 29.
    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. Sci Program 13(4):277–298 Google Scholar
  30. 30.
    PVM (Parallel Virtual Machine), Accessed August 2010. http://www.csm.ornl.gov/pvm/
  31. 31.
    Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the IEEE international symposium on high performance computer architecture (HPCA). IEEE Computer Society, Washington, pp 13–24 Google Scholar
  32. 32.
    Richardson IEG H.264/MPEG-4 part 10 white paper. Available online. http://www.vcodex.com/files/h264_overview_orig.pdf
  33. 33.
    The OpenMP API specification for parallel programming, Accessed July 2009. http://openmp.org/wp/
  34. 34.
    Thompson M, Pimentel A (2007) Towards multi-application workload modeling in sesame for system-level design space exploration. In: Embedded computer systems: architectures, modeling, and simulation, vol 4599/2007, pp 222–232 CrossRefGoogle Scholar
  35. 35.
    Valvåg SV, Johansen D (2008) Oivos: Simple and efficient distributed data processing. In: Proceedings of IEEE international conference on high performance computing and communications (HPCC), pp 113–122 CrossRefGoogle Scholar
  36. 36.
    Valvåg SV, Johansen D (2009) Cogset: A unified engine for reliable storage and parallel processing. In: Proceedings of IFIP international conference on network and parallel computing workshops (NPC), pp 174–181 CrossRefGoogle Scholar
  37. 37.
    Vrba Ž (2009) Implementation and performance aspects of Kahn process networks. PhD thesis, Department of Informatics, University of Oslo, Norway, Dec 2009. Dissertation No 903 Google Scholar
  38. 38.
    Vrba Ž, Halvorsen P, Griwodz C (2009) Evaluating the run-time performance of Kahn process network implementation techniques on shared-memory multiprocessors. In: International conference on complex, intelligent and software intensive systems (CISIS)—international workshop on multi-core computing systems (MuCoCoS), pp 639–644 Google Scholar
  39. 39.
    Vrba Ž, Halvorsen P, Griwodz C, Beskow P (2009) Kahn process networks are a flexible alternative to MapReduce. In: IEEE international conference on high performance computing and communications (HPCC), pp 154–162 Google Scholar
  40. 40.
    Vrba Ž, Halvorsen P, Griwodz C, Beskow P, Johansen D (2009) The Nornir run-time system for parallel programs using Kahn process networks. In: 6th international conference on network and parallel computing (NPC), October 2009. IEEE Computer Society, Los Alamitos, pp 1–8 Google Scholar
  41. 41.
    Vrba Ž, Halvorsen P, Griwodz C (2010) A simple improvement of the work-stealing scheduling algorithm. In: International conference on complex, intelligent and software intensive systems (CISIS)—international workshop on multi-core computing systems (MuCoCoS), pp 925–930 Google Scholar

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Željko Vrba
    • 1
    • 2
  • Pål Halvorsen
    • 1
    • 2
  • Carsten Griwodz
    • 1
    • 2
  • Paul Beskow
    • 1
    • 2
  • Håvard Espeland
    • 1
    • 2
  • Dag Johansen
    • 3
  1. 1.Simula Research LaboratoryOsloNorway
  2. 2.Department of InformaticsUniversity of OsloOsloNorway
  3. 3.Department of Computer ScienceUniversity of TromsøTromsøNorway

Personalised recommendations