Experimental Algorithmics pp 1-23

Part of the Lecture Notes in Computer Science book series (LNCS, volume 2547)

Algorithm Engineering for Parallel Computation

  • David A. Bader
  • Bernard M. E. Moret
  • Peter Sanders
Chapter

Abstract

The emerging discipline of algorithm engineering has primarily focused on transforming pencil-and-paper sequential algorithms into robust, efficient, well tested, and easily used implementations. As parallel computing becomes ubiquitous, we need to extend algorithm engineering techniques to parallel computation. Such an extension adds significant complications. After a short review of algorithm engineering achievements for sequential computing, we review the various complications caused by parallel computing, present some examples of successful efforts, and give a personal view of possible future research.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.1
    A. Aggarwal and J. Vitter. The input/output complexity of sorting and related problems. Communications of the ACM, 31:1116–1127, 1988.CrossRefMathSciNetGoogle Scholar
  2. 1.2
    A. Alexandrov, M. Ionescu, K. Schauser, and C. Scheiman. LogGP: iNcorporating long messages into the LogP model — one step closer towards a realistic model for parallel computation. In Proceedings of the 7th Annual Symposium on Parallel Algorithms and Architectures (SPAA’95), pages 95–105, 1995.Google Scholar
  3. 1.3
    E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Cros, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostouchov, and D. Sorensen. LAPACK Users’ Guide. SIAM, Philadelphia, PA, 2nd edition, 1995.Google Scholar
  4. 1.4
    D. A. Bader. An improved randomized selection algorithm with an experimental study. In Proceedings of the 2nd Workshop on Algorithm Engineering and Experiments (ALENEX’00), pages 115–129, 2000. http://www.cs.unm.edu/Conferences/ALENEX00/.
  5. 1.5
    D. A. Bader, D. R. Helman, and J. JáJá. Practical parallel algorithms for personalized communication and integer sorting. ACM Journal of Experimental Algorithmics, 1(3):1–42, 1996. http://www.jea.acm.org/1996/BaderPersonalized/.Google Scholar
  6. 1.6
    D. A. Bader, A. K. Illendula, B. M. E. Moret, and N. Weisse-Bernstein. Using PRAM algorithms on a uniform-memory-access shared-memory architecture. In Proceedings of the 5th International Workshop on Algorithm Engineering (WAE’01). Springer Lecture Notes in Computer Science 2141, pages 129–144, 2001.Google Scholar
  7. 1.7
    D. A. Bader and J. JáJá. Parallel algorithms for image histogramming and connected components with an experimental study. Journal of Parallel and Distributed Computing, 35(2):173–190, 1996.CrossRefGoogle Scholar
  8. 1.8
    D. A. Bader and J. JáJá. Practical parallel algorithms for dynamic data redistribution, median finding, and selection. In Proceedings of the 10th International Parallel Processing Symposium (IPPS’96), pages 292–301, 1996.Google Scholar
  9. 1.9
    D. A. Bader and J. JáJá. SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 58(1):92–108, 1999.CrossRefGoogle Scholar
  10. 1.10
    D. A. Bader, J. JáJá, and R. Chellappa. Scalable data parallel algorithms for texture synthesis using Gibbs random fields. IEEE Transactions on Image Processing, 4(10):1456–1460, 1995.CrossRefGoogle Scholar
  11. 1.11
    D. A. Bader, J. JáJá, D. Harwood, and L. S. Davis. Parallel algorithms for image enhancement and segmentation by region growing with an experimental study. Journal on Supercomputing, 10(2):141–168, 1996.Google Scholar
  12. 1.12
    D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakrishnan, and S. Weeratunga. The NAS parallel benchmarks. Technical Report RNR-94-007, Numerical Aerodynamic Simulation Facility, NASA Ames Research Center, Moffett Field, CA, March 1994.Google Scholar
  13. 1.13
    D. H. Bailey. Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputer Review, 4(8):54–55, 1991.Google Scholar
  14. 1.14
    R. D. Barve and J. S. Vitter. A simple and efficient parallel disk mergesort. In Proceedings of the 11th Annual Symposium on Parallel Algorithms and Architectures (SPAA’99), pages 232–241, 1999.Google Scholar
  15. 1.15
    A. Bäumker, W. Dittrich, and F. Meyer auf der Heide. Truly efficient parallel algorithms: 1-optimal multisearch for an extension of the BSP model. Theoretical Computer Science, 203(2):175–203, 1998.MATHCrossRefMathSciNetGoogle Scholar
  16. 1.16
    A. Bäumker, W. Dittrich, F. Meyer auf der Heide, and I. Rieping. Priority queue operations and selection for the BSP* model. In Proceedings of the 2nd International Euro-Par Conference. Springer Lecture Notes in Computer Science 1124, pages 369–376, 1996.Google Scholar
  17. 1.17
    A. Bäumker, W. Dittrich, F. Meyer auf der Heide, and I. Rieping. Realistic parallel algorithms: priorityq ueue operations and selection for the BSP* model. In Proceedings of the 2nd International Euro-Par Conference. Springer Lecture Notes in Computer Science 1124, pages 27–29, 1996.Google Scholar
  18. 1.18
    D. J. Becker, T. Sterling, D. Savarese, J. E. Dorband, U. A. Ranawak, and C. V. Packer. Beowulf: a parallel workstation for scientific computation. In Proceedings of the International Conference on Parallel Processing, vol. 1, pages 11–14, 1995.Google Scholar
  19. 1.19
    L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM, Philadelphia, PA, 1997.MATHGoogle Scholar
  20. 1.20
    G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. A comparison of sorting algorithms for the connection machine CM-2. In Proceedings of the 3rd Symposium on Parallel Algorithms and Architectures (SPAA’91), pages 3–16, 1991.Google Scholar
  21. 1.21
    G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha. An experimental analysis of parallel sorting algorithms. Theory of Computing Systems, 31(2):135–167, 1998.MATHCrossRefMathSciNetGoogle Scholar
  22. 1.22
    O. Bonorden, B. Juurlink, I. von Otte, and I. Rieping. The Paderborn University BSP (PUB) library — design, implementation and performance. In Proceedings of the 13th International Parallel Processing Symposium and the 10th Symposium Parallel and Distributed Processing (IPPS/SPDP’99), 1999. http://www.uni-paderborn.de/~pub/.
  23. 1.23
    A. Charlesworth. Starfire: extending the SMP envelope. IEEE Micro, 18(1):39–49, 1998.CrossRefGoogle Scholar
  24. 1.24
    J. Choi, J. J. Dongarra, R. Pozo, and D. W. Walker. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the 4th Symposium on the Frontiers of Massively Parallel Computations, pages 120–127, 1992.Google Scholar
  25. 1.25
    D. E. Culler, A. C. Dusseau, R. P. Martin, and K. E. Schauser. Fast parallel sorting under LogP: from theory to practice. In Portability and Performance for Parallel Processing, chapter 4, pages 71–98. John Wiley & Sons, 1993.Google Scholar
  26. 1.26
    D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Proceedings of the 4th Symposium on the Principles and Practice of Parallel Programming, pages 1–12, 1993.Google Scholar
  27. 1.27
    J. C. Cummings, J. A. Crotinger, S. W. Haney, W. F. Humphrey, S. R. Karmesin, J. V.W. Reynders, S. A. Smith, and T. J. Williams. Rapid application development and enhanced code interoperabily using the POOMA framework. In M. E. Henderson, C. R. Anderson, and S. L. Lyons, editors, Proceedings of the 1998 Workshop on Object Oriented Methods for Inter-operable Scientific and Engineering Computing, chapter 29. SIAM, Yorktown Heights, NY, 1999.Google Scholar
  28. 1.28
    P. de la Torre and C. P. Kruskal. Submachine locality in the bulk synchronous setting. In Proceedings of the 2nd International Euro-Par Conference, pages 352–358, 1996.Google Scholar
  29. 1.29
    S. J. Fink and S. B. Baden. Runtime support for multi-tier programming of block-structured applications on SMP clusters. In Y. Ishikawa et al., editors, Proceedings of the 1997 International Scientific Computing in Object-Oriented Parallel Environments Conference (ISCOPE’97). Springer Lecture Notes in Computer Science 1343, pages 1–8, 1997.Google Scholar
  30. 1.30
    M. Frigo and S. G. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference Acoustics, Speech, and Signal Processing, volume 3, pages 1381–1384, 1998.Google Scholar
  31. 1.31
    M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS’99), pages 285–297, 1999.Google Scholar
  32. 1.32
    A. V. Goldberg and B. M. E. Moret. Combinatorial algorithms test sets (CATS): the ACM/EATCS platform for experimental research. In Proceedings of the 10th Annual Symposium on Discrete Algorithms (SODA’99), pages 913–914, 1999.Google Scholar
  33. 1.33
    W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Technical report, Argonne National Laboratory, Argonne, IL, 1996. http://www.mcs.anl.gov/mpi/mpich/.Google Scholar
  34. 1.34
    S. E. Hambrusch and A. A. Khokhar. C 3: a parallel model for coarse-grained machines. Journal of Parallel and Distributed Computing, 32:139–154, 1996.CrossRefGoogle Scholar
  35. 1.35
    D. R. Helman, D. A. Bader, and J. JáJá. A parallel sorting algorithm with an experimental study. Technical Report CS-TR-3549 and UMIACS-TR-95-102, UMIACS and Electrical Engineering, University of Maryland, College Park, MD, December 1995.Google Scholar
  36. 1.36
    D. R. Helman, D. A. Bader, and J. JáJá. Parallel algorithms for personalized communication and sorting with an experimental study. In Proceedings of the 8th Annual Symposium on Parallel Algorithms and Architectures (SPAA’96), pages 211–220, 1996.Google Scholar
  37. 1.37
    D. R. Helman, D. A. Bader, and J. JáJá. A randomized parallel sorting algorithm with an experimental study. Journal of Parallel and Distributed Computing, 52(1):1–23, 1998.MATHCrossRefGoogle Scholar
  38. 1.38
    D. R. Helman and J. JáJá. Sorting on clusters of SMP’s. In Proceedings of the 12th International Parallel Processing Symposium (IPPS’98), pages 1–7, 1998.Google Scholar
  39. 1.39
    D. R. Helman and J. JáJá. Designing practical efficient algorithms for symmetric multiprocessors. In Proceedings of the 1st Workshop on Algorithm Engineering and Experiments (ALENEX’98). Springer Lecture Notes in Computer Science 1619, pages 37–56, 1998.Google Scholar
  40. 1.40
    D. R. Helman and J. JáJá. Prefix computations on symmetric multiprocessors. Journal of Parallel and Distributed Computing, 61(2):265–278, 2001.MATHCrossRefGoogle Scholar
  41. 1.41
    D. R. Helman, J. JáJá, and D. A. Bader. A new deterministic parallel sorting algorithm with an experimental evaluation. ACM Journal of Experimental Algorithmics, 3(4), 1997. http://www.jea.acm.org/1998/HelmanSorting/.
  42. 1.42
    High Performance Fortran Forum. High Performance Fortran Language Specification, edition 1.0, May 1993.Google Scholar
  43. 1.43
    J. M. D. Hill, B. McColl, D. C. Stefanescu, M. W. Goudreau, K. Lang, S. B. Rao, T. Suel, T. Tsantilas, and R. Bisseling. BSPlib: The BSP programming library. Technical Report PRG-TR-29-97, Oxford University Computing Laboratory, 1997. http://www.BSP-Worldwide.org/implmnts/oxtool/.
  44. 1.44
    J. JáJá. An Introduction to Parallel Algorithms. Addison-Wesley, New York, 1992.MATHGoogle Scholar
  45. 1.45
    B. H. H. Juurlink and H. A. G. Wijshoff. A quantitative comparison of parallel computation models. ACM Transactions on Computer Systems, 13(3):271–318, 1998.CrossRefGoogle Scholar
  46. 1.46
    S. N. V. Kalluri, J. JáJá, D. A. Bader, Z. Zhang, J. R. G. Townshend, and H. Fallah-Adl. High performance computing algorithms for land cover dynamics using remote sensing data. International Journal of Remote Sensing, 21(6):1513–1536, 2000.CrossRefGoogle Scholar
  47. 1.47
    J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th Annual International Symposium on Computer Architecture (ISCA’97), pages 241–251, 1997.Google Scholar
  48. 1.48
    C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong-Chan, S.-W. Yang, and R. Zak. The network architecture of the Connection Machine CM-5. Journal of Parallel and Distributed Computing, 33(2):145–158, 199.Google Scholar
  49. 1.49
    M. J. Litzkow, M. Livny, and M. W. Mutka. Condor — a hunter of idle workstations. In Proceedings of the 8th International Conference on Distributed Computing Systems, pages 104–111, 1998.Google Scholar
  50. 1.50
    C. C. McGeoch and B. M. E. Moret. How to present a paper on experimental work with algorithms. SIGACT News, 30(4):85–90, 1999.CrossRefGoogle Scholar
  51. 1.51
    Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Technical report, University of Tennessee, Knoxville, TN, June 1995. Version 1.1.Google Scholar
  52. 1.52
    F. Meyer auf der Heide and R. Wanka. Parallel bridging models and their impact on algorithm design. In Proceedings of the International Conference on Computational Science, Part II, Springer Lecture Notes in Computer Science 2074, pages 628–637, 2001.Google Scholar
  53. 1.53
    B. M. E. Moret, D. A. Bader, and T. Warnow. High-performance algorithm engineering for computational phylogenetics. Journal on Supercomputing, 22:99–111, 2002. Special issue on the best papers from ICCS’01.MATHCrossRefGoogle Scholar
  54. 1.54
    B. M. E. Moret and H. D. Shapiro. Algorithms and experiments: the new (and old) methodology. Journal of Universal Computer Science, 7(5):434–446, 2001.MATHMathSciNetGoogle Scholar
  55. 1.55
    B. M. E. Moret, A. C. Siepel, J. Tang, and T. Liu. Inversion medians outperform breakpoint medians in phylogeny reconstruction from gene-order data. In Proceedings of the 2nd Workshop on Algorithms in Bioinformatics (WABI’02). Springer Lecture Notes in Computer Science 2542, 2002.Google Scholar
  56. 1.56
    MRJ Inc. The Portable Batch System (PBS). http://www.pbs.mrj.com.
  57. 1.57
    F. Müller. A library implementation of POSIX threads under UNIX. In Proceedings of the 1993 Winter USENIX Conference, pages 29–41, 1993. http://www.informatik.hu-berlin.de/~mueller/projects.html.
  58. 1.58
    W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach. VAMPIR: visualization and analysis of MPI resources. Supercomputer 63, 12(1):69–80, January 1996.Google Scholar
  59. 1.59
    D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguadé. Is data distribution necessary in OpenMP. In Proceedings of Supercomputing, 2000.Google Scholar
  60. 1.60
    Ohio Supercomputer Center. LAM/MPI Parallel Computing. The Ohio State University, Columbus, OH, 1995. http://www.lam-mpi.org.Google Scholar
  61. 1.61
    OpenMP Architecture Review Board. OpenMP: a proposed industry standard API for shared memory programming. http://www.openmp.org, October 1997.
  62. 1.62
    Platform Computing Inc. The Load Sharing Facility( LSF). http://www.platform.com.
  63. 1.63
    E. D. Polychronopoulos, D. S. Nikolopoulos, T. S. Papatheodorou, X. Martorell, J. Labarta, and N. Navarro. An efficient kernel-level scheduling methodology for multiprogrammed shared memory multiprocessors. In Proceedings of the 12th International Conference on Parallel and Distributed Computing Systems (PDCS’99), 1999.Google Scholar
  64. 1.64
    POSIX. Information technology — Portable Operating System Interface (POSIX) — Part 1: System Application Program Interface (API). Portable Applications Standards Committee of the IEEE, edition 1996-07-12, 1996. ISO/IEC 9945-1, ANSI/IEEE Std. 1003.1.Google Scholar
  65. 1.65
    N. Rahman and R. Raman. Adapting radix sort to the memory hierarchy. In Proceedings of the 2nd Workshop on Algorithm Engineering and Experiments (ALENEX’00), pages 131–146, 2000. http://www.cs.unm.edu/Conferences/ALENEX00/.
  66. 1.66
    D. A. Reed, R. A. Aydt, R. J. Noe, P. C. Roth, K. A. Shields, B. Schwartz, and L. F. Tavera. Scalable performance analysis: the Pablo performance analysis environment. In A. Skjellum, editor, Proceedings of the Scalable Parallel Libraries Conference, pages 104–113, 1993.Google Scholar
  67. 1.67
    J. H. Reif, editor. Synthesis of Parallel Algorithms. Morgan Kaufmann, 1993.Google Scholar
  68. 1.68
    R. Reussner, P. Sanders, L. Prechelt, and M. Müller. SKaMPI: a detailed, accurate MPI benchmark. In Proceedings of EuroPVM/MPI’98. Springer Lecture Notes in Computer Science 1497, pages 52–59, 1998. See also http://liinwww.ira.uka.de/~skampi/.Google Scholar
  69. 1.69
    R. Reussner, P. Sanders, and J. Träff. SKaMPI: A comprehensive benchmark for public benchmarking of MPI. Scientific Programming, 2001. Accepted, conference version with L. Prechelt and M. Müller in Proceedings of EuroPVM/MPI’98.Google Scholar
  70. 1.70
    P. Sanders. Load balancing algorithms for parallel depth first search (In German: Lastverteilungsalgorithmen für parallele Tiefensuche). Number 463 in Fortschrittsberichte, Reihe 10. VDI Verlag, Berlin, 1997.Google Scholar
  71. 1.71
    P. Sanders. Randomized priority queues for fast parallel access. Journal of Parallel and Distributed Computing, 49(1):86–97, 1998. Special Issue on Parallel and Distributed Data Structures.MATHCrossRefGoogle Scholar
  72. 1.72
    P. Sanders. Accessing multiple sequences through set associative caches. In Proceedings of the 26th International Colloquium on Automata, Languages and Programming (ICALP’99). Springer Lecture Notes in Computer Science 1644, pages 655–664, 1999.CrossRefGoogle Scholar
  73. 1.73
    P. Sanders and T. Hansch. On the efficient implementation of massively parallel quicksort. In Proceedings of the 4th International Workshop on Solving Irregularly Structured Problems in Parallel (IRREGULAR’97). Springer Lecture Notes in Computer Science 1253, pages 13–24, 1997.Google Scholar
  74. 1.74
    U. Schöning. A probabilistic algorithm for k-SAT and constraint satisfaction problems. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pages 410–414, 1999.Google Scholar
  75. 1.75
    S. Sen and S. Chatterjee. Towards a theory of cache-efficient algorithms. In Proceedings of the 11th Annual Symposium on Discrete Algorithms (SODA’00), pages 829–838, 2000.Google Scholar
  76. 1.76
    T. L. Sterling, J. Salmon, and D. J. Becker. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. MIT Press, Cambridge, MA, 1999.Google Scholar
  77. 1.77
    L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103–111, 1990.CrossRefGoogle Scholar
  78. 1.78
    J. S. Vitter and E. A. M. Shriver. Algorithms for parallel memory I: two-level memories. Algorithmica, 12(2/3):110–147, 1994.MATHCrossRefMathSciNetGoogle Scholar
  79. 1.79
    J. S. Vitter and E. A.M. Shriver. Algorithms for parallel memory II: hierarchical multilevel memories. Algorithmica, 12(2/3):148–169, 1994.MATHCrossRefMathSciNetGoogle Scholar
  80. 1.80
    R. Whaley and J. Dongarra. Automatically tuned linear algebra software (ATLAS). In Proceedings of Supercomputing’98, 1998. http://www.netlib.org/utk/people/JackDongarra/PAPERS/atlas-sc98.ps.
  81. 1.81
    H. A. G. Wijshoff and B. H. H. Juurlink. A quantitative comparison of parallel computation models. In Proceedings of the 8th Annual Symposium on Parallel Algorithms and Architectures (SPAA’96), pages 13–24, 1996.Google Scholar
  82. 1.82
    Y. Yan and X. Zhang. Lock bypassing: an efficient algorithm for concurrently accessing priority heaps. ACM Journal of Experimental Algorithmics, 3(3), 1998. http://www.jea.acm.org/1998/YanLock/.
  83. 1.83
    Z. Zhang, J. JáJá, D. A. Bader, S. Kalluri, H. Song, N. El Saleous, E. Vermote, and J. Townshend. Kronos: A Software System for the Processing and Retrieval of Large-Scale AVHRR Data Sets. Photogrammetric Engineering and Remote Sensing, 66(9):1073–1082, September 2000.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • David A. Bader
    • 1
  • Bernard M. E. Moret
    • 1
  • Peter Sanders
    • 2
  1. 1.Departments of Electrical and Computer Engineering, and Computer ScienceUniversity of New MexicoAlbuquerqueUSA
  2. 2.Max-Planck-Institut für InformatikSaarbrückenGermany

Personalised recommendations