SmartApps: An Application Centric Approach to High Performance Computing

  • Lawrence Rauchwerger
  • Nancy M. Amato
  • Josep Torrellas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2017)

Abstract

State-of-the-art run-time systems are a poor match to diverse, dynamic distributed applications because they are designed to provide support to a wide variety of applications, without much customization to individual specific requirements. Little or no guiding information flows directly from the application to the run-time system to allow the latter to fully tailor its services to the application. As a result, the performance is disappointing. To address this problem, we propose application-centric computing, or SMART APPLICATIONS. In the executable of smart applications, the compiler embeds most run-time system services, and a performance-optimizing feedback loop that monitors the application’s performance and adaptively reconfigures the application and the OS/hardware platform. At run-time, after incorporating the code’s input and the system’s resources and state, the SmartApp performs a global optimization. This optimization is instance specific and thus much more tractable than a global generic optimization between application, OS and hardware. The resulting code and resource customization should lead to major speedups. In this paper, we first describe the overall architecture of Smartapps and then present the achievements to date: Run-time optimizations, performance modeling, and moderately reconfigurable hardware.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    S. Adve, V. Adve, M. Hill, and M. Vernon. Comparison of Hardware and Software Cache Coherence Schemes. In Proc. of the 18th ISCA, pp. 298–308, June 1991.Google Scholar
  2. [2]
    N.M. Amato, J. Perdue, A. Pietracaprina, G. Pucci, and M. Mathis. Predicting performance on smps. a case study: The SGI Power Challenge. In Proc. IPDPS, pp. 729–737, May 2000.Google Scholar
  3. [3]
    M. Auslander, H. Franke, B. Gamsa, O. Krieger, and M. Stumm. Customization lite. In Proc. of 6th Workshop on Hot Topics in Operating Systems (HotOS-VI), May 1997.Google Scholar
  4. [4]
    J. Appavo B. Gamsa, O. Krieger and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Proc. of OSDI, 1999.Google Scholar
  5. [5]
    B. Grant, et. al. An evaluation of staged run-time optimizations in Dyce. In Proc. of the SIGPLAN 1999 PLDI, Atlanta, GA, May 1999.Google Scholar
  6. [6]
    G.E. Blelloch, P.B. Gibbons, Y. Mattias, and M. Zagha. Accounting for memory bank contention and delay in high-bandwidth multiprocessors. IEEE Trans. Par.Dist. Sys., 8(9):943–958, 1997.CrossRefGoogle Scholar
  7. [7]
    J. Mark Bull. Feedback guided dynamic loop scheduling: Algorithms and experiments. In EUROPAR98, Sept., 1998.Google Scholar
  8. [8]
    F. Dang and L. Rauchwerger. Speculative parallelization of partially parallel loops. In Proc. of the 5th Int. Workshop LCR 2000, Lecture Notes in Computer Science, May 2000.Google Scholar
  9. [9]
    D. Engler. Vcode: a portable, very fast dynamic code generation system. In Proc. of the SIGPLAN 1996 PLDI Philadelphia, PA, May 1996.Google Scholar
  10. [10]
    D. Bailey et al. The NAS parallel benchmarks. Int. J. Supercomputer Appl., 5(3):63–73, 1991.CrossRefGoogle Scholar
  11. [11]
    P. B. Gibbons, Y. Matias, and V. Ramachandran. Can a shared-memory model serve as a bridging-model for parallel computation? In Proc. ACM SPAA, pp. 72–83, 1997.Google Scholar
  12. [12]
    H. Han and C.-W. Tseng. Improving compiler and run-time support for adaptive irregular codes. In PACT’98, Oct. 1998.Google Scholar
  13. [13]
    R. Iyer, N. Amato, L. Rauchwerger, and L. Bhuyan. Comparing the memory system performance of the HP V=AD-Class and SGI Origin 2000 multiprocessors using microbenchmarks and scientific applications. In Proc. of ACM ICS, pp. 339–347, June 1999.Google Scholar
  14. [14]
    B. H. H. Juurlink and H. A. G.Wijshoff. A quantitative comparison of parallel computation models. In Proc. of ACM SPAA, pp. 13–24, 1996.Google Scholar
  15. [15]
    D. Keppel, S. J. Eggers, and R. R. Henry. A case for runtime code generation. TR UWCSE 91-11-04, Dept. of Computer Science and Engineering, Univ. of Washington, Nov. 1991..Google Scholar
  16. [16]
    O. Krieger and M. Stumm. Hfs: A performance-oriented flexible file system based on building-block compositions. IEEE Trans. Comput., 15(3):286–321, 1997.Google Scholar
  17. [17]
    S. Owicki and A. Agarwal. Evaluating the performance of software cache coherency. In Proc. of ASPLOS III, April 1989.Google Scholar
  18. [18]
    L. Rauchwerger, N. Amato, and D. Padua. A Scalable Method for Run-time Loop Parallelization. Int. J. Paral. Prog., 26(6):537–576, July 1995.CrossRefGoogle Scholar
  19. [19]
    L. Rauchwerger and D. Padua. The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization. IEEE Trans. on Par. and Dist. Systems, 10(2), 1999.Google Scholar
  20. [20]
    L. Rauchwerger and D. Padua. Parallelizing WHILE Loops for Multiprocessor Systems. In Proc. of 9th IPPS, April 1995.Google Scholar
  21. [21]
    Silicon Graphics Corporation 1995. SGI Power Challenge: User’s Guide, 1995.Google Scholar
  22. [22]
    R. Simoni and M. Horowitz. Modeling the Performance of Limited Pointer Directories for Cache Coherence. In Proc. of the 18th ISCA, pp. 309–318, June 1991.Google Scholar
  23. [23]
    J. T orrellas, J. Hennessy, and T. Weil. Analysis of Critical Architectural and Programming Parameters in a Hierarchical Shared Memory Multiprocessor. In ACM Sigmetrics Conf. on Measurement and Modeling of Computer Systems, pp. 163–172, May 1990.Google Scholar
  24. [24]
    H. Y u and L. Rauchwerger. Adaptive reduction parallelization. In Proc. of the 14th ACM ICS, Santa Fe, NM, May 2000.Google Scholar
  25. [25]
    Y. Zhang, L. Rauchwerger, and J. Torrellas. Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors. In Proc. of HPCA-4, pp. 162–173, 1998.Google Scholar
  26. [26]
    Y. Zhang, L. Rauchwerger, and J. Torrellas. Speculative Parallel Execution of Loops with Cross-Iteration Dependences in DSM Multiprocessors. In Proc. of HPCA-5, Jan. 1999.Google Scholar
  27. [27]
    Ye Zhang. DSM Hardware for Speculative Parallelization. Ph.D. Thesis, Department of ECE, Univ. of Illinois, Urbana, IL, Jan. 1999Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Lawrence Rauchwerger
    • 1
  • Nancy M. Amato
    • 1
  • Josep Torrellas
    • 2
  1. 1.Department of Computer ScienceTexas A&M UniversityUSA
  2. 2.Department of Computer ScienceUniversity of IllinoisIllinois

Personalised recommendations