The predecessor of Scalasca, from which Scalasca evolved, is known by the name of KOJAK.
Scalasca is an open-source software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes. Scalasca targets mainly scientific and engineering applications based on the programming interfaces MPI and OpenMP, including hybrid applications based on a combination of the two. The tool has been specifically designed for use on large-scale systems including IBM Blue Gene and Cray XT, but is also well suited for small- and medium-scale HPC platforms.
Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is expanding from generation to generation. As a consequence, supercomputing applications are required to harness much higher degrees of parallelism in order to satisfy their enormous demand for computing power. However, with today’s leadership systems featuring more than a hundred thousand cores, writing efficient codes that exploit all the available parallelism becomes increasingly difficult. Performance optimization is therefore expected to become an even more essential software-process activity, critical for the success of many simulation projects. The situation is exacerbated by the fact that the growing number of cores imposes scalability demands not only on applications but also on the software tools needed for their development.
Making applications run efficiently on larger scales is often thwarted by excessive communication and synchronization overheads. Especially during simulations of irregular and dynamic domains, these overheads are often enlarged by wait states that appear in the wake of load or communication imbalance when processes fail to reach synchronization points simultaneously. Even small delays of single processes may spread wait states across the entire machine, and their accumulated duration can constitute a substantial fraction of the overall resource consumption. In particular, when trying to scale communication-intensive applications to large processor counts, such wait states can result in substantial performance degradation.
To address these challenges, Scalasca has been designed as a diagnostic tool to support application optimization on highly scalable systems. Although also covering single-node performance via hardware-counter measurements, Scalasca mainly targets communication and synchronization issues, whose understanding is critical for scaling applications to performance levels in the petaflops range. A distinctive feature of Scalasca is its ability to identify wait states that occur, for example, as a result of unevenly distributed workloads.
To evaluate the behavior of parallel programs, Scalasca takes performance measurements at runtime to be analyzed postmortem (i.e., after program termination). The user of Scalasca can choose between two different analysis modes:
Performance overview on the call-path level via profiling (called runtime summarization in Scalasca terminology)
In-depth study of application behavior via event tracing
In profiling mode, Scalasca generates aggregate performance metrics for individual function call paths, which are useful for identifying the most resource-intensive parts of the program and assessing process-local performance via hardware-counter analysis. In tracing mode, Scalasca goes one step further and records individual performance-relevant events, allowing the automatic identification of call paths that exhibit wait states. This core feature is the reason why Scalasca is classified as an automatic tool. As an alternative, the resulting traces can be visualized in a traditional time-line browser such as VAMPIR to study the detailed interactions among different processes or threads. While providing more behavioral detail, traces also consume significantly more storage space and therefore have to be generated with care.
Preparation of a target application executable for measurement and analysis requires it to be instrumented to notify the measurement library, which is linked to the executable, of performance-relevant execution events whenever they occur at runtime. On all systems, a mix of manual and automatic instrumentation mechanisms is offered. Instrumentation configuration and processing of source files are achieved by prefixing selected compilation commands and the final link command with the Scalasca instrumenter, without requiring other changes to optimization levels or the build process, as in the following example for the file foo.c:
scalasca -instrument mpicc -c foo.c
Scalasca follows a direct instrumentation approach. In contrast to interrupt-based sampling, which takes periodic measurements whenever a timer expires, Scalasca takes measurements when the control flow reaches certain points in the code. These points mark performance-relevant events, such as entering/leaving a function or sending/receiving a message. Although instrumentation points have to be chosen with care to minimize intrusion, direct instrumentation offers advantages for the global analysis of communication and synchronization operations. In addition to pure direct instrumentation, future versions of Scalasca will combine direct instrumentation with sampling in profiling mode to better control runtime dilation, while still supporting global communication analyses .
Measurements are collected and analyzed under the control of a workflow manager that determines how the application should be run, and then configures measurement and analysis accordingly. When tracing is requested, it automatically configures and executes the parallel trace analyzer with the same number of processes as used for measurement. The following examples demonstrate how to request measurements from MPI application bar to be executed with 65,536 ranks, once in profiling and once in tracing mode (distinguished by the use of the “-t” option).
scalasca -analyze mpiexec ∖
-np 65536 bar <arglist>
scalasca -analyze -t mpiexec ∖
-np 65536 bar <arglist>
Scalasca can efficiently calculate many execution performance metrics by accumulating statistics during measurement, avoiding the cost of storing them with events for later analysis. For example, elapsed times and hardware-counter metrics for source regions (e.g., routines or loops) can be immediately determined and the differences accumulated. Whereas trace storage requirements increase in proportion to the number of events (dependent on the measurement duration), summarized statistics for a call-path profile per thread have a fixed storage requirement (dependent on the number of threads and executed call paths).
In addition to call-path visit counts, execution times, and optional hardware counter metrics, Scalasca profiles include various MPI statistics, such as the numbers of synchronization, communication and file I/O operations along with the associated number of bytes transferred. Each metric is broken down into collective versus point-to-point/individual, sends/writes versus receives/reads, and so on. Call-path execution times separate MPI message-passing and OpenMP multithreading costs from purely local computation, and break them down further into initialization/finalization, synchronization, communication, file I/O and thread management overheads (as appropriate). For measurements using OpenMP, additional thread idle time and limited parallelism metrics are derived, assuming a dedicated core for each thread.
Scalasca provides accumulated metric values for every combination of call path and thread. A call path is defined as the list of code regions entered but not yet left on the way to the currently active one, typically starting from the main function, such as in the call chain main() → foo() → bar(). Which regions actually appear on a call path depends on which regions have been instrumented. When execution is complete, all locally executed call paths are combined into a global dynamic call tree for interactive exploration (as shown in the middle of Fig. 2, although the screen shot actually visualizes a wait-state report).
In message-passing applications, processes often require access to data provided by remote processes, making the progress of a receiving process dependent upon the progress of a sending process. If a rendezvous protocol is used, this relationship also applies in the opposite direction. Collective synchronization is similar in that its completion requires each participating process to have reached a certain point. As a consequence, a significant fraction of the time spent in communication and synchronization routines can often be attributed to wait states that occur when processes fail to reach implicit or explicit synchronization points in a timely manner. Scalasca provides a diagnostic method that allows the localization of wait states by automatically searching event traces for characteristic patterns.
Figure 3 shows several examples of wait states that can occur in message-passing programs. The first one is the Late Sender pattern (Fig. 3a ), where a receiver is blocked while waiting for a message to arrive. That is, the receive operation is entered by the destination process before the corresponding send operation has been entered by the source process. The waiting time lost as a consequence is the time difference between entering the send and the receive operations. Conversely, the Late Receiver pattern (Fig. 3b ) describes a sender that is likely to be blocked while waiting for the receiver when a rendezvous protocol is used. This can happen for several reasons. Either the MPI implementation is working in synchronous mode by default, or the size of the message to be sent exceeds the available MPI-internal buffer space and the operation is blocked until the data is transferred to the receiver. The Late Sender / Wrong Order pattern (Fig. 3c ) describes a receiver waiting for a message, although an earlier message is ready to be received by the same destination process (i.e., messages received in the wrong order). Finally, the Wait at N×N pattern (Fig. 3d ) quantifies the waiting time due to the inherent synchronization in n-to-n operations, such as MPI_Allreduce(). A full list of the wait-state types supported by Scalasca including explanatory diagrams can be found online in the Scalasca documentation .
Parallel Wait-State Search
To accomplish the search in a scalable way, Scalasca exploits both distributed memory and parallel processing capabilities available on the target system. After the target application has terminated and the trace data have been flushed to disk, the trace analyzer is launched with one analysis process per (target) application process and loads the entire trace data into its distributed memory address space. Future versions of Scalasca may exploit persistent memory segments available on systems such as Blue Gene/P to pass the trace data to the analysis stage without involving any file I/O. While traversing the traces in parallel, the analyzer performs a replay of the application’s original communication. During the replay, the analyzer identifies wait states in communication and synchronization operations by measuring temporal differences between local and remote events after their time stamps have been exchanged using an operation of similar type. Detected wait-state instances are classified and quantified according to their significance for every call path and system resource involved. Since trace processing capabilities (i.e., processors and memory) grow proportionally with the number of application processes, good scalability can be achieved even at previously intractable scales. Recent scalability improvements allowed Scalasca to complete trace analyses of runs with up to 294,912 cores on a 72-rack IBM Blue Gene/P system .
A modified form of the replay-based trace analysis scheme is also applied to detect wait states occurring in MPI-2 RMA operations. In this case, RMA communication is used to exchange the required information between processes. Finally, Scalasca also provides the ability to process traces from hybrid MPI/OpenMP and pure OpenMP applications. However, the parallel wait-state search does not yet recognize OpenMP-specific wait states, such as barrier waiting time or lock contention, previously supported by its predecessor.
Wait-State Search on Clusters without Global Clock
To allow accurate trace analyses on systems without globally synchronized clocks, Scalasca can synchronize inaccurate time stamps postmortem. Linear interpolation based on clock offset measurements during program initialization and finalization already accounts for differences in offset and drift, assuming that the drift of an individual processor is not time dependent. This step is mandatory on all systems without a global clock, such as Cray XT and most PC or compute blade clusters. However, inaccuracies and drifts varying over time can still cause violations of the logical event order that are harmful to the accuracy of the analysis. For this reason, Scalasca compensates for such violations by shifting communication events in time as much as needed to restore the logical event order while trying to preserve the length of intervals between local events. This logical synchronization is currently optional and should be performed if the trace analysis reports (too many) violations of the logical event order.
Future enhancements of Scalasca will aim at further improving both its functionality and its scalability. In addition to supporting the more advanced features of OpenMP such as nested parallelism and tasking as an immediate priority, Scalasca is expected to evolve toward emerging programming models and architectures including partitioned global address space (PGAS) languages and heterogeneous systems. Moreover, optimized data management and analysis workflows including in-memory trace analysis will allow Scalasca to master even larger processor configurations than it does today. A recent example in this direction is the substantial reduction of the trace-file creation overhead that was achieved by mapping large numbers of logical process-local trace files onto a small number of physical files , a feature that will become available in future releases of Scalasca.
In addition to keeping up with the rapid new developments in parallel hardware and software, research is also undertaken to expand the general understanding of parallel performance in simulation codes. The examples below summarize two ongoing projects aimed at increasing the expressive power of the analyses supported by Scalasca. The description reflects the status of March 2011.
Time-Series Call-Path Profiling
However, even generating call-path profiles (as opposed to traces) separately for thousands of iterations to identify the call paths responsible may exceed the available buffer space – especially when the call tree is large and more than one metric is collected. For this reason, a runtime approach for the semantic compression of a series of call-path profiles based on incrementally clustering single-iteration profiles was developed that scales in terms of the number of iterations without sacrificing important performance details . This method, which will be integrated in future versions of Scalasca, offers low runtime overhead by using only a condensed version of the profile data when calculating distances and accounts for process-dependent variations by making all clustering decisions locally.
Identifying the Root Causes of Wait-State Formation
However, excess workload identified as the root cause of wait states usually cannot simply be removed. To achieve a better balance, optimization hypotheses drawn from such an analysis typically propose the redistribution of the excess load to other processes instead. Unfortunately, redistributing workloads in complex message-passing applications can have surprising side effects that may compromise the expected reduction of waiting times. Given that balancing the load statically or even introducing a dynamic load-balancing scheme constitute major code changes, such procedures should ideally be performed only if the prospective performance gain is likely to materialize. Other recent work  therefore concentrated on determining the savings we can realistically hope for when redistributing a given delay – before altering the application itself. Since the effects of such changes are hard to quantify analytically, they are simulated in a scalable manner via a parallel real-time replay of event traces after they have been modified to reflect the redistributed load.
Bibliographic Notes and Further Reading
Scalasca is available for download including documentation under the New BSD license at http://www.scalasca.org.
Scalasca emerged from the KOJAK project, which was started in 1998 at the Jülich Supercomputing Centre in Germany to study the automatic evaluation of parallel performance data, and in particular, the automatic detection of wait states in event traces of parallel applications. The wait-state analysis first concentrated on MPI  and later on OpenMP and hybrid codes , motivating the definition of the POMP profiling interface  for OpenMP, which is still used today even beyond Scalasca by OpenMP-enabled profilers such as ompP ( OpenMP Profiling with OmpP) and TAU. A comprehensive description of the initial trace-analysis toolset resulting from this effort, which was publicly released for the first time in 2003 under the name KOJAK, is given in [ 19].
During the following years, KOJAK’s wait-state search was optimized for speed, refined to exploit virtual process topologies, and extended to support MPI-2 RMA communication. In addition to the detection of wait states occurring during a single run, KOJAK also introduced a framework for comparing the analysis results of different runs , for example, to judge the effectiveness of optimization measures. An extensive snapshot of this more advanced version of KOJAK including further literature references is presented in . However, KOJAK still analyzed the traces sequentially after the process-local trace data had been merged into a single global trace file, an undesired scalability limitation in view of the dramatically rising number of cores employed on modern parallel architectures.
In 2006, after the acquisition of a major grant from the Helmholtz Association of German Research Centres, the Scalasca project was started in Jülich as the successor to KOJAK with the objective of improving the scalability of the trace analysis by parallelizing the search for wait states. A detailed discussion of the parallel replay underlying the parallel search can be found in . Variations of the scalable replay mechanism were applied to correct event time stamps taken on clusters without global clock , to simulate the effects of optimizations such as balancing the load of a function more evenly across the processes of a program , and to identify wait states in MPI-2 RMA communication in a scalable manner . Moreover, the parallel trace analysis was also demonstrated to run on computational grids consisting of multiple geographically dispersed clusters that are used as a single coherent system . Finally, a very recent replay-based method attributes the costs of wait states in terms of resource waste to their original cause .
Since the enormous data volume sometimes makes trace analysis challenging, runtime summarization capabilities were added to Scalasca both as a simple means to obtain a performance overview and as a basis to optimally configure the measurement for later trace generation. Scalasca integrates both measurement options in a unified tool architecture, whose details are described in . Recently, a semantic compression algorithm was developed that will allow Scalasca to take time-series profiles in a space-efficient manner even if the target application performs large numbers of timesteps .
Major application studies with Scalasca include a survey of using it on leadership systems , a comprehensive analysis of how the performance of the SPEC MPI2007 benchmarks evolves as their execution progresses , and the investigation of a gradually developing communication imbalance in the PEPC particle simulation . Finally, a recent study of the Sweep3D benchmark demonstrated performance measurements and analyses with up to 294,912 processes .
From 2003 until 2008, KOJAK and later Scalasca was jointly developed together with the Innovative Computing Laboratory at the University of Tennessee. During their lifetime, the two projects received funding from the Helmholtz Association of German Research Centres, the US Department of Energy, the US Department of Defense, the US National Science Foundation, the German Science Foundation, the German Federal Ministry of Education and Research, and the European Union. Today, Scalasca is a joint project between Jülich and the German Research School for Simulation Sciences in nearby Aachen.
The following individuals have contributed to Scalasca and its predecessor: Erika Ábrahám, Daniel Becker, Nikhil Bhatia, David Böhme, Jack Dongarra, Dominic Eschweiler, Sebastian Flott, Wolfgang Frings, Karl Fürlinger, Christoph Geile, Markus Geimer, Marc-André Hermanns, Michael Knobloch, David Krings, Guido Kruschwitz, André Kühnal, Björn Kuhlmann, John Linford, Daniel Lorenz, Bernd Mohr, Shirley Moore, Ronal Muresano, Jan Mußler, Andreas Nett, Christian Rössel, Matthias Pfeifer, Peter Philippen, Farzona Pulatova, Divya Sankaranarayanan, Pavel Saviankou, Marc Schlütter, Christian Siebert, Fengguang Song, Alexandre Strube, Zoltán Szebenyi, Felix Voigtländer, Felix Wolf, and Brian Wylie.