Application of Multi-core Architecture to the MPDRoot Package for the Task ToF Events Reconstruction

  • Oleg Iakushkin
  • Anna Fatkina
  • Alexander Degtyarev
  • Valery Grishkin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10408)

Abstract

In this article, we propose an approach that allows acceleration of the Time-of-Flight (ToF) event reconstruction algorithm implementation, which is a part of the Multi Purpose Detector (MPD) Root application.

Work on the algorithm was carried out in several stages: the program was assembled on the target devices (Intel Xeon E5-2690v3 and E5-2695 v2); Profiling via Valgrind was performed; We selected a code snippet whose execution takes the longest time; Several algorithms for parallelizing code were investigated and the optimal strategy of code enhancement for the equipment in question was implemented.

Modification of the selected code fragment was carried out using the OpenMP standard. It is widely used in scientific applications, including the reconstruction of events in the PANDA experiment, and has proven to be useful for work in Multi-Core architecture. The standard is supported by the GCC compiler used to build the MpdRoot framework, which makes it possible to integrate this technology into a fragment of the MpdRoot package without changing the structure or build options of the framework.

Due to our optimizations, the algorithm was accelerated on Multi-Core architectures at hand. Paper depicts the direct dependence of the accelerated fragment execution time to the amount of given cores for a given amount of input data. Tests were conducted on the nodes of the heterogeneous cluster JINR “HybriLIT” and cloud node Windows Azure NC12. The paper analyzes the possibilities of optimizing the code for Intel Xeon Phi coprocessors and the problems that we encountered while trying to implement these optimizations.

Keywords

ToF MPD Parallel computing OpenMP Reconstruction 

1 Introduction

The experimental study of charged particle collision is carried out by means of particle accelerators. The accelerators are complex machines with many components. Among them, a major role is played by particle collision detectors. The detectors have module-based architecture, and each module requires its own piece of programming code that is part of the detectors software. The data obtained by the detector is analysed by a sequence of routines. This highlights the analysis problem and requires to enhance the performance of the basic code fragments [2, 4, 7, 8].

NICA (Nuclotron based Ion Collider fAcility) is one of research centres in this field. It includes the Multi-Purpose Detector (MPD) that obtains experimental data to be further modelled and processed by MpdRoot, a framework developed for the MPD. The MpdRoot is based on FairRoot and ROOT projects that are widely used in nuclear physics research by such centres as CERN and FAIR [1, 3].

There is an API available for ROOT that allows to execute algorithms for MpdRoot. The algorithms are implemented as macros that is, special script-containing files that are fed to ROOT as input data.

We focused our attention on the event reconstruction algorithm. After the algorithm is launched, it is executed as follows: first, it reads data modelling results obtained by the MPD; and, second, it uses FairRoot task manager to start a sequence of jobs. Among these jobs, the Time-of-Flight (ToF) matching consumes a major portion of runtime.

This paper examines the ways to optimize the implementation of ToF algorithm in MpdRoot. We propose algorithm modifications that are based on the parallel programming approach.

2 Problem Statement

The algorithm in question has a code fragment whose runtime is a quadratic function of the input data volume, with the input data volume being unknown in advance. This code fragment consumes 47.3% of the algorithms entire runtime. This means that the increase in the input will significantly slow down the processing of the data obtained by the detector.

We examined a number of possible technologies to optimize the code on a variety of devices. The choice was made in favour of multi-core CPUs, because the code transfer to GPU coprocessors would be complicated: the code employs the many data types described in MpdRoot and ROOT, which makes it difficult to copy data on the devices [9, 10, 12, 19]. Furthermore, there is much more memory available in a CPU compared to GPU, which may play the key role when the input data becomes especially large [11, 13].

The optimization was carried out on Intel Xeon processors using the OpenMP standard. This standard is widely employed for parallel execution of algorithms, including those in MpdRoot and other packages that are utilized to process data in various nuclear physics experiments (PANDA, CBM) [16]. In addition, OpenMP is supported by the same compiler that is used to build MpdRoot. We used the implementation of OpenMP compiled by GCC.

We should note that the use of any other compiler for the entire project and its dependences would be complicated, because the source code has a directive that is called many times and contains instructions for GCC (“gcc diagnostic”). The use of a different compiler would result in these instructions being ignored.

The implementation of OpenMP compiled by GCC allows to integrate this technology without making changes in the frameworks structure, which would be necessary if various compilers were used in one project.

3 Analysis

We used Valgrind Callgrind profiler to analyze MpdRoots performance in terms of the bottlenecks that require optimization. This tool allows to build graphs of function invocations taking into account their runtime and number of calls. We obtained profiling data for the event reconstruction algorithm. Then we selected the fragment whose execution was the most time-consuming.
Fig. 1.

Part of profiling results for “reco.C” macro.

The profiling results in Diagram 1 show that the longest time is spent on the aggregate of invocations of the FindNeighbourStrips function. We examined the source code to identify that this function contains a nested loop. We further examined the source code of the FindNeighbourStrips function to modify it for parallel processing.
Fig. 2.

illustrates the work of the method to be optimized, including the calls of ROOT functions.

Strips1 and strips2 are local variables. They are defined anew at each iteration of the external and internal loops respectively. The calls of methods for strips1 can result in a conflict. The Diagram 2 also shows calls of ROOTs Fill methods. Here conflicts are also possible, because the source code was made to process the original input data, not its copies. To prevent possible errors, the source code was isolated by means of C++ “mutex” and “lock guard” constructs.

This means that the parallel execution on different cores will not corrupt the result of the method we seek to optimize. In other words, the code fragment in question will be modified for parallel execution, while its result will not be affected.

3.1 Optimization

We used GCC version 4.9.3 to compile MpdRoots source code. This version supports OpenMP 4.0 and C++. OpenMP 4.0 places restrictions on the type of variables used in loops [17]. They can be integers, random access iterators or variables of a pointer type.

The selected algorithm contains a code fragment that can be optimized for parallel execution. Its implementation uses instances of classes described in MpdRoot. Specifically, the loop requiring optimization traverses over the instances of MStripIT and MStripCIT classes. The types of data mentioned above serve as wraps for the iterators and belong to MpdToF class. This structure impedes the use of OpenMP parallel loops.

We decided to wrap the looped over elements into vector. This allowed to use the “range-based for” loop supported in C\(_{++}\). The following modifications were made to prepare the code for optimization:
  • The code fragments run at each iteration were subdivided in two parts that, in theory, can be competing. Each part is now represented by the “inline” function.

  • The critical code fragments executed at one and the same iteration were supplemented with “lock guard” object to prevent possible errors that could emerge due to several threads processing the same data.

The source code of the loop under consideration was modified to accommodate the said changes. We also added OpenMP directives that allow to run the code in parallel.
Fig. 3.

Fragment of reco.C macro profiling after modification.

We employed two possible ways of using OpenMP. The first way involved the “range-based for” loop. The parallel regions were defined by means of “parallel”, “single” and “task” directives. The “task” directive called the code fragment that corresponded to the loop body in the unmodified version of the source code.

The second way involved the iterator variable of integer type. The traversed-over elements had been wrapped into vector, so they were called as vector elements by their number. To allow parallel execution, we used “parallel for” directive with “schedule” option that described the load distribution between the threads. The best result was obtained when the option was defined as “auto”.

On average, these two methods of parallel processing yielded the same acceleration. Diagram 3 shows the changes undergone by the call tree of functions in the examined fragment after the described modifications.

After the optimization, we performed data validation by comparing the results returned by the “sequential” and the optimized versions: the results were fully identical.
Fig. 4.

Dependence between data volume and acceleration.

3.2 Tests

The tests were performed on a heterogeneous cluster HybriLIT and Windows Azure NC12. The following processors were used: Intel Xeon E5-2690v3, E5-2695 v2 and E5-2695 v3.

The parameters of the processors are provided in Table 1:
  • E5-2695 v2 - 12 cores, 24 threads;

  • E5-2695 v3 - 14 cores, 28 threads;

  • E5-2690v3 - 12 cores, 24 threads.

The “libgomp” library allows to define the number of cores that will execute the parallel regions of the code. Furthermore, the library allows to allocate particular threads to be used for processing by their numbers. The diagram below shows the test results obtained on a CPU of cluster HybriLIT and Azure NC12 node. It also shows how the number of threads depends on the runtime of the code being optimized.

The Diagram 4 illustrates the dependence between the acceleration yielded (compared to the sequential processing) and the input data volume. The number of iterations with respect to one loop is shown on the x-axis. The diagram shows that the modified algorithm yields a stably better productivity.

The next Diagram 5 shows that the best result was obtained by 28 threads of E5-2695 v3 processor. Compared to the sequential version, the runtime improved by 23 times. The loop is nested, and the number of external and internal loop iterations is the same. Therefore, the total number of calls made to internal functions is N*N, where N is the number of iterations in one of the two loops. The curves in the diagram show the algorithms execution dependence on the number of cores. In other words, the maximum number of cores that yields additional productivity is still to be reached. This allows to assume that systems with more cores can be used without a drop in productivity: for example, processors and coprocessors Intel Xeon Phi.
Fig. 5.

Tesing.

3.3 Other Parallel Programming Options

The tests revealed that runtime is inversely proportional to the number of cores. Furthermore, we have already described above that the optimized algorithm allows to use multithread processing, but even after optimization the selected code fragment remains one of the most time-consuming regions in the algorithm.

Arguably, further acceleration can be achieved by means of allocating the code fragment in question to Intel Xeon Phi coprocessor. There are two families of Intel Xeon Phi [5, 20]: Product Family x200 processors and Product Family x100 coprocessors. They both are characterized by a larger number of cores compared to Intel Xeon. Specifically, Intel Xeon Phi products have up to 73 cores, while Intel Xeon products, not more than 24 cores. At the same time, Intel Xeon processors have higher frequency than Intel Xeon Phi, which has an impact on software performance.

The icc compiler developed by Intel is used to work with Intel Xeon Phi coprocessors [15]. The compiler supports the OpenMP standard and provides tools for data exchange between the CPU and coprocessors. There are two ways to copy data to and from the coprocessor:
  • by means of “#pragma offload” directive;

  • by means of “Cilk” keywords allowing work with Shared Memory.

The use of the former data exchange model is limited by the restrictions on the data types that will be handled by the coprocessor. The data can be copied only if it is represented by arrays, scalars or user data structures without pointers [18].

The selected code fragment uses instances of ROOTs and MpdRoots classes that are not supported by this model. The latter model of data exchange provides better opportunities to work with user-defined classes. We should note, however, that the keywords for a class members and methods whose elements will be copied to the coprocessor must be defined in advance [14].

The algorithm in question uses data types and methods described in ROOT. Therefore, this model cannot be used to transfer data between the processor and coprocessor without changing ROOTs source code by means of including additional keywords, which would, in its turn, necessitate the use of the icc compiler.

However, the use of icc or any other compiler to work with MpdRoot and its dependences is impeded by the directives that specifically address GCC. Therefore, the transfer of data structures and their dependencies constitutes a problem that goes beyond MpdRoot package considered in this paper.

We have analysed another alternative to optimize the code in question the OpenACC standard. OpenACC allows to execute a parallel code not only on a CPU, but also on coprocessors. However, the use of OpenACC and Intel Shared Memory to parallelize loops that call external functions requires special directives to define such functions. This entails modification of a substantial part of MpdRoots and its dependencies source code.

We have earlier considered the possibility of using a GPU with CUDA to accelerate selected fragments of MpdRoot [6]. In particular, we described the ways to optimize a portion of Kalman filter by using CUDA technology. In that case, the optimization was possible, because the data types in ROOT can be easily modified to suite C++ standards. In addition, each iteration required the memory size comfortably available in GPUs.

When optimizing a code fragment in the ToF algorithm, we, conversely, have to deal with a changing size of input data, while the algorithm itself works with complex data structures.

3.4 Conclusions

This paper offers optimization of the ToF event reconstruction algorithms implementation for the MPD in order to allow its execution on multicore CPUs. We analyzed the available implementation of the fragment in question and described source code modifications that resulted in an up to 23-fold acceleration be means of multithread processing on Intel Xeon processors. The optimization involved the use of OpenMP implementation provided by GCC compiler. We also considered the ways to perform optimization on a GPU and coprocessors and outlined the challenges that emerged when we attempted to port the source code to such devices.

Notes

Acknowledgments

This research was partially supported by Russian Foundation for Basic Research grant (projects no. 16-07-01113 and no. 16-07-00886). Microsoft Azure for Research Award (http://research.microsoft.com/en-us/projects/azure/) as well as the resource center “Computer Center of SPbU” (http://cc.spbu.ru/en) provided computing resources. The authors would like to acknowledge the Reviewers for the valuable recommendations that helped in the improvement of this paper.

References

  1. 1.
    Al-Turany, M., Bertini, D., Karabowicz, R., Kresan, D., Malzacher, P., Stockmanns, T., Uhlig, F.: The FairRoot framework. J. Phys. Conf. Ser. 396, 022001 (2012). IOP PublishingCrossRefGoogle Scholar
  2. 2.
    Bogdanov, A.V., Degtyarev, A., Stankova, E.N.: Example of a potential grid technology application in shipbuilding. In: 2007 International Conference on Computational Science and its Applications (ICCSA 2007), pp. 3–8 (2007)Google Scholar
  3. 3.
    Brun, R., Rademakers, F.: Root an object oriented data analysis framework. Nucl. Instrum. Methods Phys. Res. Sec. A: Accelerators, Spectrometers, Detectors and Associated Equipment 389(1), 81–86 (1997)CrossRefGoogle Scholar
  4. 4.
    Chao, A., Mess, K., Tigner, M., Zimmermann, F.: Handbook of Accelerator Physics and Engineering. World Scientific Publishing Company (2013)Google Scholar
  5. 5.
    Chrysos, G.: Intel\(^{\textregistered }\) xeon phi coprocessor-the architecture. Intel Whitepaper 176 (2014)Google Scholar
  6. 6.
    Fatkina, A., Iakushkin, O., Tikhonov, N.: Application of GPGPUs and multicore CPUS in optimization of some of the MpdRoot codes. In: 25th Russian Particle Accelerator Conference (RuPAC 2016), St. Petersburg, Russia, 21–25 November 2016, pp. 416–418. JACOW, Geneva (2017)Google Scholar
  7. 7.
    Gankevich, I., Gaiduchok, V., Gushchanskiy, D., Tipikin, Y., Korkhov, V., Degtyarev, A., Bogdanov, A., Zolotarev, V.: Virtual private supercomputer: design and evaluation. In: Ninth International Conference on Computer Science and Information Technologies Revised Selected Papers, pp. 1–6, September 2013Google Scholar
  8. 8.
    Gankevich, I., Korkhov, V., Balyan, S., Gaiduchok, V., Gushchanskiy, D., Tipikin, Y., Degtyarev, A., Bogdanov, A.: Constructing virtual private supercomputer using virtualization and cloud technologies. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcão, M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2014. LNCS, vol. 8584, pp. 341–354. Springer, Cham (2014). doi: 10.1007/978-3-319-09153-2_26 Google Scholar
  9. 9.
    Grishkin, V., Iakushkin, O.: Middleware transport architecture monitoring: topology service. In: 2014 20th International Workshop on Beam Dynamics and Optimization (BDO), pp. 1–2 (2014)Google Scholar
  10. 10.
    Iakushkin, O.: Cloud middleware combining the functionalities of message passing and scaling control. In: EPJ Web of Conferences, vol. 108 (2016)Google Scholar
  11. 11.
    Iakushkin, O., Grishkin, V.: Messaging middleware for cloud applications: extending brokerless approach. In: 2014 2nd International Conference on Emission Electronics (ICEE), pp. 1–4 (2014)Google Scholar
  12. 12.
    Iakushkin, O., Sedova, O., Valery, G.: Application control and horizontal scaling in modern cloud middleware. In: Gavrilova, M.L., Tan, C.J.K. (eds.) Transactions on Computational Science XXVII. LNCS, vol. 9570, pp. 81–96. Springer, Heidelberg (2016). doi: 10.1007/978-3-662-50412-3_6 CrossRefGoogle Scholar
  13. 13.
    Iakushkin, O., Grishkin, V.: Unification of control in P2P communication middleware: towards complex messaging patterns. AIP Conf. Proc. 1648(1), 040004 (2015)CrossRefGoogle Scholar
  14. 14.
    Iakushkin, O., Shichkina, Y., Sedova, O.: Petri nets for modelling of message passing middleware in cloud computing environments. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O., Stankova, E., Wang, S. (eds.) ICCSA 2016. LNCS, vol. 9787, pp. 390–402. Springer, Cham (2016). doi: 10.1007/978-3-319-42108-7_30 CrossRefGoogle Scholar
  15. 15.
    Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Elsevier Science, Boston (2013)Google Scholar
  16. 16.
    Kisel, I.: Scientific and high-performance computing at fair. In: EPJ Web of Conferences, vol. 95, p. 01007. EDP Sciences (2015)Google Scholar
  17. 17.
    OpenMP Architecture Review Board: OpenMP application program interface version 4.0 (2013). http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf
  18. 18.
    Rahman, R.: Intel\(^{\textregistered }\) Xeon Phi Coprocessor Architecture and Tools: The Guide for Application Developers. Expert’s Voice in Microprocessors. Apress (2013)Google Scholar
  19. 19.
    Shichkina, Y., Degtyarev, A., Gushchanskiy, D., Iakushkin, O.: Application of optimization of parallel algorithms to queries in relational databases. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Taniar, D., Apduhan, B.O., Stankova, E., Wang, S. (eds.) ICCSA 2016. LNCS, vol. 9787, pp. 366–378. Springer, Cham (2016). doi: 10.1007/978-3-319-42108-7_28 CrossRefGoogle Scholar
  20. 20.
    Sodani, A., Gramunt, R., Corbal, J., Kim, H.S., Vinod, K., Chinthamani, S., Hutsell, S., Agarwal, R., Liu, Y.C.: Knights landing: second-generation intel xeon phi product. IEEE Micro 36(2), 34–46 (2016)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Oleg Iakushkin
    • 1
  • Anna Fatkina
    • 1
  • Alexander Degtyarev
    • 1
  • Valery Grishkin
    • 1
  1. 1.Saint-Petersburg State UniversitySt. PetersburgRussia

Personalised recommendations