Refactoring GrPPI: Generic Refactoring for Generic Parallelism in C++

The Generic Reusable Parallel Pattern Interface (GrPPI) is a very useful abstraction over different parallel pattern libraries, allowing the programmer to write generic patterned parallel code that can easily be compiled to different backends such as FastFlow, OpenMP, Intel TBB and C++ threads. However, rewriting legacy code to use GrPPI still involves code transformations that can be highly non-trivial, especially for programmers who are not experts in parallelism. This paper describes software refactorings to semi-automatically introduce instances of GrPPI patterns into sequential C++ code, as well as safety checking static analysis mechanisms which verify that introducing patterns into the code does not introduce concurrency-related bugs such as race conditions. We demonstrate the refactorings and safety-checking mechanisms on four simple benchmark applications, showing that we are able to obtain, with little effort, GrPPI-based parallel versions that accomplish good speedups (comparable to those of manually-produced parallel versions) using different pattern backends.


Introduction
The scale of parallelism in modern hardware systems is increasing at a very fast rate, with 72-core systems available off-the-shelf even in the embedded market. 1 At the same time, such systems are becoming increasingly heterogeneous, integrating GPUs, FPGAs, DSLs and other specialised processors within the same chip. The large scale of parallelism and heterogeneity of systems make programming modern parallel hardware very difficult, often requiring a combination of different (and usu-ally complex) programming models (e.g. POSIX threads for the central multicore processor and OpenCL for GPUs), coupled with careful manual tuning, to achieve good performance. Parallel patterns [4] have been recognised as an excellent compromise between the ease of programming and the ability to generate efficient code for large-scale heterogeneous parallel architectures. They have been endorsed by several major IT companies, such as Intel [42] and Microsoft [14], giving rise to a multitude of parallel pattern libraries, most of which are incompatible with one another and each of which usually has specific advantages (and disadvantages) over the others. GrPPI 2 [16] represents one of the first attempts at a uniform interface to parallel patterns, based on C++ template programming, that allows the generation of target code for different pattern libraries. Listings 1 and 2 show code for a farm pattern (a typical embarrassingly parallel pattern) that targets both OpenMP and Intel TBB pattern libraries. We can observe that the only difference is in the declaration of the par variable, in Line 1, where the back-end for implementation is specified. While GrPPI makes the patterned code easier to write [16], and also minimizes the cost of switching between different implementation pattern libraries, transforming sequential code to use GrPPI is still a very non-trivial task. The programmer needs to first ensure that it is safe (in terms of unexpected side effects and race conditions) to transform the sequential code into its equivalent parallel implementation, and, secondly, to transform loops of the sequential code into calls to the appropriate patterns. An equivalent version of the code in Listing 2 is given in Listing 12. This paper describes refactorings to introduce GrPPI patterns into sequential code. These refactorings, implemented in the ParaFormance toolset for developing and maintaining parallel programs, provide a semi-automatic way of transforming sequential C++ code into its parallel patterned counterpart. The programmer is only required to insert simple annotations in the code to denote the parts that are, possibly, amenable to patterned parallelisation. We also describe safety checking mechanisms that ensure the parts of the code annotated by the programmer are, indeed, safe to be transformed into patterns. Safety checking is based on static analyses of the loops in the application to ensure that refactoring the code does not introduce any undesired behaviour to the execution once parallelised. The specific research contributions of this paper are: 1. Novel refactorings to introduce farm and pipeline parallelism, based on the GrPPI interface, into sequential C++ code, implemented as part of the ParaFormance refactoring tool-set; 2. A study into the process of refactoring sequential C++ applications into their safe parallel equivalents, on a range of different real-world examples, for different parallel implementations, using a fully-automated tool-supported refactoring framework; 3. Demonstrations of the effectiveness of the performance of refactored examples, showing speedups of up to 23.93 on a 28-core machine over the sequential code; 4. The transformations of four applications into their GrPPI equivalents, together with a discussion of the transformation methods.

Patterns, GrPPI and Refactoring
Parallel patterns are a high-level abstraction for representing classes of computations that are similar in terms of their parallel structure, but different in terms of problemspecific operations. A typical example of a parallel pattern is a parallel map, where the same operation is applied to disjoint subsets of the input in parallel. Regardless of whether the actual operation is, for example, multiplying a matrix by a vector or processing a pixel of an image, the parallel structure of the computation will be the same. Parallel patterns are typically implemented as library functions, which handle creation, synchronisation and communication between parallel threads, while the problem-specific (and often sequential) computations are provided as the pattern parameters. In this paper, we restrict ourselves to two classical parallel patterns, which can be further generalised to include a broader set of parallel patterns.
-The pipeline pattern models a parallel pipeline. Here, a sequence of functions, . . , f m are applied, to a stream of independent inputs, x 1 , . . . , x n . The output of f i becomes the input to f i+1 , so that the parallelism arises from executing ). In this case the parallelism arises from executing different stages f i in parallel while items progress through the pipeline. -The farm pattern models a task parallel computation for a stage f i in a pipeline that can be applied to a stream of independent inputs, x 1 , . . . , x n . For each item x j in the input stream the farm delivers to the output stream the value f i (x j ). Multiple applications of the operation to different input stream elements may be processed in parallel. Details and semantics of these and other patterns are described in [16]. Note that the technique described in this paper can be further generalised to include the full set of commonly-used parallel patterns.
Refactoring is the process of changing the structure of a program while preserving its functional semantics in order, for example, to increase code quality, programming productivity and code reuse. The term refactoring was first introduced by Opdyke in his PhD thesis in 1992 [38], and the concept goes at least as far back as the fold/unfold system proposed by Burstall and Darlington in 1977 [13]. In our case, refactorings are source-to-source transformations of the code that are performed semi-automatically, under the programmer's guidance, and possibly with their input. ParaFormance 3 is a refactoring tool-suite developed at the University of St Andrews that refactors C and C++ programs into parallel versions. It targets a number of different back-ends, including FastFlow, OpenMP, GrPPI and Intel Threading Building Blocks (TBB).
GrPPI [16] is a parallel pattern interface that uses C++ template metaprogramming to provide implementations of a number of parallel patterns. The patterns in GrPPI are generic over a number of different parallelism models, currently including support for ISO C++ Threads, OpenMP, Intel TBB, and FastFlow, as well as sequential execution. The ability of decoupling patterns from the concrete execution model is a key idea in GrPPI. With GrPPI, the application source code can be the same independently of the concrete execution policy. This approach reduces the conceptual load for programmers as they can focus on how computations are composed without paying attention to details that are specific to a programming model [16]. Moreover, this increases portability as moving from one parallel framework to a different one has no significant impact on source code. Additionally, some models might not be available (or allowed by coding standards or certification policies) in given platforms which is solved in GrPPI by selecting a different one. The ISO C++ Threads back-end is provided as a fallback and is always guaranteed to be present in any ISO C++ compliant platform. In [23] additional evidence on the negligible overhead of GrPPI over manual implementations is provided. GrPPI offers a set of patterns that can be classified in three groups: data patterns, task patterns and stream patterns. In this paper we focus on stream patterns. Data patterns perform a transformation on one or more data sets and give as a result a new data set (map, stencil) or a value (reduce or map/reduce). In all those patterns the input dimensionality is unbounded, meaning that the transformation can be applied to any number of data sets. This is different to the C++ standard library, which takes one or two data sets as an input, but similar to the approach of SkePU 2 [21]. The only task pattern included in GrPPI is divide-and-conquer which allows the expression of computations where a given problem is split (possibly recursively) into smaller sub-problems which are then solved and combined. GrPPI is able to apply parallelism during the three stages of the pattern. The basic GrPPI model for stream parallelism is a pipeline which processes a stream of data items: items are produced by a generator, processed by some intermediate components, and then disposed of by a sink. The individual stages can all be executed in parallel. A pipeline stage can be any callable entity. However, the most common case is a C++ lambda expression. A simple example is given below that reads a number of integers from a file, squares each one, and writes the results to standard output. The first lambda reads a value from the input file and returns it. If the read operation returns false (meaning end-of-file) the empty optional is returned. The second stage is a farm replicating its lambda in four independent tasks. Each of them receives a number and returns its square. Finally, the third lambda receives a number and prints it to standard output. Note, that all the communication and synchronization between stages is managed internally by GrPPI. The parallel_execution_native definition introduces a GrPPI object that describes the execution model to be used by the parallel pipeline. In this case, the native object is used to indicate that the native execution model (C++ threads) should be used; in order to use, e.g., TBB instead, one would replace this with parallel_execution_tbb. Further details on the interfaces and semantics of patterns available in GrPPI can be found in [16].

Refactoring for Introducing a GrPPI Pipeline
We define a refactoring that converts a C++ for loop into a GrPPI pipeline pattern, containing one or more stages that are executed concurrently to process a sequence of data items indexed by the loops. We currently support both sequential stages that process a single data item at a time, and farm stages that process multiple items in parallel. For a singular farm pattern, our current refactoring approach would transform the code into a pipeline with a single stage that is farmed.

Refactoring Strategy
To refactor a for loop, the loop must be in the form, 1 for (T x=e1; e2; e3) {. . .} for some variable, x, and type T, where T may be omitted. As is standard, e1 is an initial value for x, e2 is some bounding condition, and e3 is an expression that updates the value of x. When e2 is empty, the pipeline will run forever. The refactoring requires that e3 must not be empty, and that the loop must declare or initialise a single variable in the initialiser; this variable represents data items passing through the pipeline. These patterns enable the refactoring of common types of loop; e.g.
The refactoring requires that the body of the for loop is a compound statement enclosed in braces, i.e. ....., and is dependent upon the existence of pragmas in the loop body that indicate the pipeline stages. These pragmas may be introduced manually by the programmer, or automatically via some tool, which we intend to investigate as part of future work. We note that these pragmas are not part of GrPPI, and do not affect the functional behaviour of the program in any way, but are instead introduced here as an aid to the refactoring. These pragmas include: -#pragma grppi seq stage -#pragma grppi farm stagen A GrPPI farm stage pragma must specify a number of threads to execute the farm, specified by n, where n is a literal integer or variable name bound to a literal integer. It is possible to create a pipeline that comprises a single farm stage: this is equivalent to running multiple (and possibly all) iterations of the loop concurrently. Once invoked, the refactoring requires the programmer to provide: a name for the pattern to be inserted; the model of parallelism, e.g. C++ threads or TBB; and any additional headers for other modes. The refactoring procedure creates a GrPPI pipeline object and inserts a source object that returns consecutive values for x in the form of optional items, with an empty value when there are no data items left. The pipeline stages in the loop body are converted into a sequence of lambda expressions following the source. For example, the loop, which iterates over the array xs, applying the composition of s1 and s2, is transformed into If a variable is declared inside a loop stage but required in a later stage, it will be returned from the corresponding lambda and passed as an argument to the following lambda: if there is more than one such variable then they will be returned packed into a tuple which will then be unpacked into variables in the next stage. Variables that are declared outside the loop are captured in the lambda expressions by reference, which means that the lambdas can modify them. This may lead to race conditions. They can also be captured by value when no modification is needed. This option improves thread safety.

Refactoring Observations
This paper presents a full framework for tool-supported parallel programming in C++. GrPPI provides a high-level unified interface to the underlying skeleton implementation, providing a palette of parallelisations to execute. Refactoring provides the user with the choice and guidance to ensure a correct and safe implementation of the skeletal choice. Although GrPPI provides a high-level easily accessible interface to skeletal programming in C++, refactoring is still a necessary step and provides a number of unique benefits over manual parallelisation. Refactoring helps the user make decisions about their program and the appropriate skeletal configuration to choose for their application. Our refactoring support, together with its safety checking, avoids common pitfalls in parallel programming, such as introducing deadlocks and race-conditions, which are notoriously difficult and subtle errors to find and repair. Refactoring follows very precise (and well-understood) transformation rules that are based on well known semantics-preserving program rewriting techniques, ensuring that only correct programs can be derived, saving the programmer time and effort in fixing bugs. The idea of annotating the source program with annotations (or pragmas) is different from previous refactoring approaches of introducing farms and pipelines [29]. Identifying the stages and/or components of the skeletons using pragmas also allows for further tool-support to discover candidates for parallelism [10]. Such static analysis techniques (such as those in [17]) can identify the components of the skeletons and insert annotations into the source code that the refactorings can then use.

Safety Checking
The refactorings presented in this paper should preserve the correctness of the functional semantics of the C++ program. This means that when given the same input value(s), the program should produce the same output value(s) before and after a refactoring, up to a given ordering. This is ensured by safety checking, a fundamental feature in the ParaFormance tool. It gives confidence that the code being refactored is safe for parallelisation, meaning that it is free from dependencies, side effects (such as writing to a global state) and that it does not contain code that may interfere with the parallel logic. In the ParaFormance tool, we centred the safety checking around Array Race Detection Analysis, a common technique that is based on Pugh's Omega Test Library [39] and standard compiler techniques, such as data dependency analysis and those described in [2] and [37]. We give some details of our implementation in this section. Some terminology: then the loop will only execute with i=0 and i=1, so there cannot be a race with A [2]. However, our test assumes that i can take any positive value, and so will report a possible data race at i=2 (even though the user may be able to see that we never get to i=2 there will not be a race condition, but there will be one if i+=9 is put instead. Other safety checks that we implement as part of this framework will be described in future work.

Evaluation
In this section, we present an evaluation of the refactorings to introduce GrPPI patterns into sequential code. We consider four benchmark applications: Mandelbrot, Matrix Multiplication, Ant Colony Optimisation and Image Convolution. As described in Table 1, the benchmark applications belong to different domains and also contain different compositions of parallel patterns, hence the refactorings applied for their parallelisation are different. In addition, refactoring some of them exposes safety problems that are caught by the safety analysis, whereas the refactoring process for others is more straightforward. For each, we start with a given sequential version of a benchmark. We then use the refactorings described in Sect. 3 to introduce GrPPI patterns, using interfaces to C++ threads and TBB (GrPPI Native and GrPPI TBB, respectively, in the graphs below). This produces the refactored parallel versions. To measure the performance of these versions, we compare them with the manually-produced parallel versions of the baseline applications (Par Manual in the graphs below). These versions have been written by hand using C++ threads and are highly optimised. Our goal is to verify how the execution time of the refactored parallel versions compare with good hand-produced parallel code. All of our execution experiments are conducted on a server with a 28-core Intel Xeon E5-2690 CPU running at 2.6 GHz, with 256 GB of RAM, and the Scientific Linux 6.2 operating system.

Mandelbrot
Mandelbrot is a simple benchmark that calculates a Mandelbrot set for a set of points in complex plane and visualises it. A point C from the complex plane is in the Mandelbrot set if the orbit z n of the point, obtained using a recurrence relation z n+1 = z 2 n + C, does not tend to infinity. The set can be visualised by colouring the points in the complex plane based on the number of steps of the recursive relation required to reach the maximum radius. The relevant part of the annotated original sequential version is given in Listing 3. Note that the pragma at Line 7 denotes the loop on the Lines 8-13 as a candidate for parallelisation using the farm pattern. However, if we try to refactor the aforementioned loop by replacing it with an instance of the farm pattern, the safety checking analysis recognises that there is a global variable k that is incremented in each iteration of the loop and that, therefore, straightforward parallelisation of this loop using the farm pattern would introduce race conditions. The ParaFormance tool also suggests the rewriting of the code into the equivalent version which avoids the above problem by induction variable substitution; i.e. replacing the global variable k inside the loop with a local one that is calculated based on the loop index. The listing in Figure 4 shows this version. As explained in Sect. 3, the GrPPI TBB version of the code can easily be obtained by replacing parallel_execution_native with parallel_execution_tbb on line 4 of Listing 5. Finally, Listing 6 shows the manually produced parallel version of the code, using the TBB library.  Figure 1a shows the speedups obtained for the GrPPI Native, GrPPI TBB and Manual Parallelisation versions of the code, with respect to the number of workers used in the farm pattern of the GrPPI and manual TBB versions. We can observe that all the versions give very good and comparable results, so in this case the refactored version of the code produced semi-automatically is as good in terms of performance as is the manually produced parallel version.

Matrix Multiplication
Matrix multiplication is one of the most commonly used simple parallel benchmarks that demonstrates the use of the map or farm pattern. Listing 7 shows the sequential version of the benchmark that multiplies matrix a with matrix b and stores the result in res. We can parallelise this version by assigning a separate task/thread to each call to the multiply_row_by_column function, as these calls are completely independent of each other. This would, however, create n × n tasks/threads for multiplying two n × n matrices, making the parallelisation too fine-grained (and, indeed, infeasible if the C++ threads are used for parallelisation). Therefore, both in our hand-tuned baseline parallelisation using C++ threads and in the GrPPI version, we parallelise only the loop in the matrix_multiply function (Line 16 in Listing 7), assigning separate tasks for each call to the multiply_row_by_matrix function. In the baseline version, shown in Listing 8, we further use chunking to increase granularity, grouping multiple calls to the function into a single thread, so that we have exactly as many threads as there are, for example, cores on a multicore machine where the application is executed, and each of them executes a series of calls to the multiply_row_by_matrix function. As explained in Sect. 2, to derive the OpenMP or TBB versions, we would just need to replace the parallel_execution_native with parallel_execution_openmp or parallel_execution_tbb, respectively. This assigns a separate task to each call to the multiply_row_by_matrix function. Note that we do not use chunking in this version, as this would require use of the map pattern which is present in GrPPI, but this is outside of the scope of this paper. Note also that this example does not introduce any safety problems, as there are no race conditions. Figure 1 shows the speedups obtained for the Baseline, GrPPI Native and GrPPI TBB versions of the code, with respect to the number of workers used in the farm pattern of the GrPPI versions and the number of threads used in the baseline version. We can note that all versions give very good speedups, which are comparable to the GrPPI Native version; the GrPPI TBB being slightly faster than the Baseline version. However, it is worth noting that the GrPPI Native and GrPPI TBB versions use one thread more than the Baseline version, because there is a separate thread assigned to the first stage of pipeline (Lines 6-10 in Listing 9). Therefore, speedups when the same number of threads are used would be approximately the same.

Image Convolution
Image convolution is a technique widely used in image processing applications for blurring, smoothing and edge detection. We consider an instance of the image convolution from video processing applications, where we are given a list of images that are first read from a file and then processed by applying a filter. Applying a filter to an image consists of computing a scalar product of the filter weights with the input pixels within a window surrounding each of the output pixels: An obvious parallelisation of this algorithm is to set up a pipeline where the first stage reads images from a file and the second stage applies the filter to the read images. Each of the stages can be further farmed, so that we can read and process multiple images at the same time. An alternative parallelisation is to set up a farm, where each worker first reads an image from a file and then processes that image. The former parallelisation is better if we need to have a different number of workers in the two farms, i.e. if reading images is notably slower than their processing, whereas the latter one is better if the two operations are of approximately the same computational cost. To demonstrate the refactoring of this example into parallelised GrPPI versions, we start with the original sequential version, below, in Listing 10, refactoring it to a single farm, annotated with a GrPPI pragma indicating a single candidate exists for parallelisation using the farm pattern (Line 2). Running this through the ParaFormance tool, we are able to refactor the code into the GrPPI parallelisation as shown in Listing 11. In this example, the code passes the safety-checking phase of the refactoring process, due to the fact that Image Convolution is a classical parallelisation example, being a relatively straightforward application to parallelise. Speedup results for the GrPPI farm version of Image Convolution are shown in Fig. 3a, where we show speedups for GrPPI versions utilising both the native backend and the TBB backend. Furthermore, we produce a native TBB version, shown below, in Listing 12. Figure 3a compares all versions of the farmed application with comparable results. One benefit of refactoring instead of manual parallelisation, is that it offers one easily the choice of different parallelisations. For example, Image Convolution is also an example where, instead of a typical parallelisation using a farm, we can, instead, parallelise with a pipeline, farming different stages to attempt to increase the parallelisation. This can be achieved by returning to the sequential application, by choosing undo from the ParaFormance refactoring menu, and then adjusting the GrPPI pragmas, so that the stages of the pipeline are properly outlined, as shown below, in Listing 13. Here, we annotate the sequential version with two pragmas: one, at Line 4, indicating a farm stage for the computation read_image_and_mask, and a further one at Line 6 for the computation process_image. For both of these pragmas, we use defined variables, farm1 and farm2, indicating the number of farm workers for each stage of the pipeline. The result of the refactored code is shown below, in Listing 14.   Speedup results for the Image Convolution are shown in Figs. 2a (for the GrPPI Native pipeline version), 2b (for the GrPPI TBB pipeline version) and 3a (for the farm version, including the baseline TBB parallelisation). In Figs. 2a, b, each dimension of the graphs shows a varying number of workers for the farm in the first pipeline stage (Δ 1 ) and the the x-axis of the graph shows the increasing number of workers for the farm in the second stage (Δ 2 ). Here, we obtain speedups of around 13.98 for 2 Δ 1 workers and 16 Δ 2 workers. We can observe good speedups of up to 13.98 for the GrPPI Native pipeline, 21.23 for the GrPPI TBB pipeline version and 21.43 for the GrPPI Native farm version. We can also observe in Fig. 3a that the GrPPI versions perform approximately the same as the native TBB version, giving almost the same speedups.

Ant Colony Optimisation
Ant Colony Optimisation (ACO) [20] is a metaheuristic used for solving NP-hard combinatorial optimisation problems. In this paper, we apply ACO to the Single Machine Total Weighted Tardiness Problem (SMTWTP) optimisation problem, where we are given n jobs and each job, i, is characterised by its processing time, p i , deadline, d i , and weight, w i . The goal is to find the schedule of jobs that minimises the total weighted tardiness, defined as where C i is the completion time of the job, i. The ACO solution to the SMTWTP problem consists of a number of iterations, where in each iteration each ant independently computes a schedule, and is biased by a pheromone trail that is stronger along previously successful routes. After all the ants have finished computing solutions in one iteration, results are gathered, the new best one is picked, the pheromone trail is updated accordingly and the next iteration starts. The relevant part of the original sequential code is given in Listing 15, with the addition of a pragma on Line 3. One of the parallelisations of this algorithm is to set up a sequential pipeline, where ants compute solutions in the first stage, and in the second stage, the new running best solution is picked and the pheromone trail is updated. Note that the second stage is inherently sequential, but the first stage can be farmed, assigning a separate task for each ant. This is the parallelisation that we used. The speedup results of which are shown in Fig. 3. We can observe a speedup up to 23.93 for the GrPPI Native version with 28 workers and a comparable speedup of 23.90 with the GrPPI TBB version, also for 28 workers. We compare this to a baseline TBB parallelisation that doesn't use GrPPI and that gives comparable speedups of 22 cost[j] = solve (j); 10 })); 11 12 best_t = pick_best(cost,&best_result); 13 update(best_t, best_result);

Related Work
Refactoring has roots in Burstall and Darlington's fold/unfold system [13], and has been applied to a wide range of applications as an approach to program transformation [35], with refactoring tools a feature of popular IDEs including, i.a., Eclipse [22] and Visual Studio [36]. Previous work on parallelisation via refactoring has primarily focussed on the introduction and manipulation of parallel pattern libraries in C++ [11,29] and Erlang [6,9]. Another approach has been the automated introduction of annotations in the form of C++ attributes [17]. Parallel design patterns, or algorithmic skeletons, were suggested as solution to the difficulties presented by low-level   [4,24]. A range of pattern/skeleton implementations have been developed for a number of programming languages; these include: RPL [29]; Feldspar [5]; Fast-Flow [1]; Microsoft's Parallel Patterns Library [14]; and Intel's Threading Building Blocks (TBB) library [42]. Since patterns are well-defined, rewrites can be used to automatically explore the space of equivalent patterns, e.g. optimising for performance [26,34] or generating optimised code as part of a DSL [25]. Moreover, since patterns are architecture-agnostic, patterns have been similarly implemented for multiple architectures [28,41]. This introduces a level of specialisation, and the possibility of choice between pattern implementations. Conversely, GrPPI [16] is capable of invoking other libraries, and is thereby able to take advantage of the specialisations that they present without potentially laborious reimplementation. Elsewhere, approaches to automatic parallelisation have traditionally focussed on the transformation of loops. Examples include Lamport's early approaches in Fortran [30], Artigas' approach for Java [2], on doall and doacross loops [12,33], the polyhedral model [3,7,8] and more recently on the generation of pipelines [45,46]. Other approaches to automatic parallelism have included a focus on coarsely dividing programs into sections that can be run in parallel [32,43]; less-abstractly on exploiting potential parallelism at the instruction-level [44]; and on exploiting specialised hardware such as GPUs for automatic parallelisation [27,31]. Whilst fully automatic approaches simplify the parallelisation process for the programmer by removing them from the process, such approaches can be very specific in both the parallelism they are able to introduce and the code to which they can be applied. Conversely, programmer-in-the-loop approaches, such as refactoring, allow the programmer to employ their knowledge about both code and parallelism. Similar to our approach, Dig et al. [18] use refactoring to introduce parallelism in Java. However, unlike our approach, Dig et al. introduce low-level Java concurrency primitives instead of patterns. More recently, Radoi and Dig consider data races in Java for parallelism, a key aspect of safety checking [40]. Other safety checking aspects are covered by work on deadlock detection [15]. PPAT [19], is a parallel pattern identification tool, that uses static analysis and instrumentation to find pipeline and farm patterns. PPAT is designed as an offline (i.e. manual) refactoring framework with no support for interactive refactoring. Another notable difference with our approach is the integration of safety checks, which are not considered by PPAT. Additionally, while PPAT uses instrumentation to perform dynamic analysis and then update annotations used for refactoring, we avoid the need to instrument and rerun the application by allowing developers to insert annotations in the form of pragmas that will be understood by the refactoring framework.

Conclusions and Future Work
In this paper, we presented new refactorings for C++ that transform sequential code into fully parallel equivalent implementations using the GrPPI framework. These refactorings are implemented in the ParaFormance tool. Targeting GrPPI allows the programmer to refactor their sequential C++ code into one parallel version that targets many different backends, such as C++ threads, TBB Fastflow, and OpenMP, without having to be a domain expert in parallel programming, or have expertise or knowledge in any of the available parallel libraries. We also presented safety checking mechanisms that ensure the applied refactorings are correct; i.e. that they do not break the semantics of the sequential code. We also demonstrated that we are able to derive good parallel code with the refactorings, achieving speedups similar to the hand-tuned parallel versions. This shows that we are able to produce, with little programming effort, scalable and portable parallel code. In future, we plan to extend our refactorings and safety checking techniques further, to support additional patterns, such as stencil, divide and conquer and reduce. We also plan to evaluate the refactorings on larger use-cases.