Distributed Heterogeneous Parallel Computing Framework Based on Component Flow

Li, Jianqing; Li, Hongli; Li, Jing; Chen, Jianmin; Liu, Kai; Chen, Zheng; Liu, Li

doi:10.1007/978-981-19-2456-9_45

Jianqing Li^40,41,
Hongli Li⁴¹,
Jing Li⁴²,
Jianmin Chen⁴²,
Kai Liu⁴²,
Zheng Chen⁴² &
…
Li Liu⁴²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8445 Accesses
1 Citations

Abstract

Single processor has limited computing performance, slow running speed and low efficiency, which is far from being able to complete complex computing tasks, while distributed computing can solve such huge computational problems well. Therefore, this paper carried out a series of research on the heterogeneous computing cluster based on CPU+GPU, including component flow model, multi-core multi processor efficient task scheduling strategy and real-time heterogeneous computing framework, and realized a distributed heterogeneous parallel computing framework based on component flow. The results show that the CPU+GPU heterogeneous parallel computing framework based on component flow can make full use of the computing resources, realize task parallel and load balance automatically through multiple instances of components, and has the characteristics of good portability and reusability.

You have full access to this open access chapter, Download conference paper PDF

Research on LogGP Based Parallel Computing Model for CPU/GPU Cluster

JCL: An OpenCL Programming Toolkit for Heterogeneous Computing

A Task Parallel Programming Framework Based on Heterogeneous Computing Platforms

Keywords

1 Introduction

High performance computing (HPC) is the basic technology of information technology, and the key technology to promote information networking. With the diversified development of chip technology, there are so many kinds of high-performance processors, including CPU, GPU, MIC, FPGA, etc.. Each of these processors is suitable for different application scenarios or algorithms [1, 2]. The current simple computing mode of single processor can not meet the complex work requirements [3]. In order to improve the hardware processing capacity, we usually take CPU as the main control and connect GPU, MIC, FPGA and CPU through PCIE bus to accelerate the computing tasks, that is, the heterogeneous computing mode of CPU+X. Among them, the heterogeneous computing mode of CPU+GPU is the most mature and has the best performance [4]. The peak performance of NVIDIA Tesla V100 GPU reaches 15TFlops. Compared with the traditional CPU, the GPU-accelerated server can improve the calculation speed by dozens of times under the same computational accuracy [5, 6]. Therefore, this paper studies the heterogeneous computing cluster of CPU+GPU. However, the heterogeneous computing of CPU+GPU brings two new problems [7, 8], including distributed computing resource scheduling strategy and task scheduling strategy between CPU and GPU. For these two problems, we can use multi-core multi processor to solve [9]. The full application of multi-core and multi-processor involves multi-core resource scheduling, multi-task scheduling, inter-processor communication, load balancing, etc.. Optimal scheduling of parallel tasks on multiple processors has been proven to be NP-hard [10]. TDS (Task Duplication Scheduling) [11] divides all tasks into multiple paths according to the dependency topology, and the tasks on each path are executed as a group on one processor. Although this method reduces the delay and shortens the running time, it will increase the energy consumption. In addition, the hardware structure, application and development mode of CPU and GPU processor are different, resulting in poor portability [12]. Sourouri [13] used a simple 3D 7-point stencil computation and statically partition the suitable workload between CPU and GPU to show 1.1–1.2 times of acceleration. Pereira [14] demonstrated a simple static load balancing between CPU and GPU on a single template application, showing up to 1.28 acceleration. Then, Pereira [15] used time tiling on the same pskel framework to reduce the communication requirements between CPU and GPU, but increased redundant computing. Most of them use static load balancing, only consider a single (often repeated) mold, it is difficult to extend to larger applications, with poor reusability.

In view of the above contents, this paper researches on component flow, multi-core multi processor and real-time computing process. Firstly, based on the model of component flow, the model and function of components and component flow suitable for CPU and GPU heterogeneous parallel computing are determined. Then, based on multi-core and multi processor, the task scheduling strategy, data distribution strategy and multi-core parallel strategy are explored. Finally, on the basis of radar signal level simulation, the CPU+GPU heterogeneous computing framework system based on the simulation model is proposed and verified. The results show that the CPU+GPU heterogeneous framework based on component stream can make full use of the computing resources of heterogeneous multiprocessors, improve the computing speed and efficiency of radar signal simulation, realize the automatic distribution and load balancing on multiple computers through components, and has the characteristics of good portability, strong reusability and fast computing speed.

2 Component Flow Model

2.1 Component Flow Model

Developing algorithms directly on CPU and GPU processors will lead to poor reusability and portability of algorithms. Therefore, this paper studies the model based on component flow to realize the algorithm reuse. A component is an abstract model of a computing function, as shown in Fig. 1. The numbers on the left and right represent the serial numbers of the input and output ports respectively. The component model also includes initialization function and processing function, which are automatically called when initialization and data arrive, respectively. Component container is a process running on CPU, which is responsible for data communication between processors, dynamic loading and initialization of local components, and providing versions of operating system. The component flow diagram defines the data flow and temporal relationship between components, and realizes the specific algorithm logic. As shown in Fig. 2, the component flow diagram of an application is used to configure the data input and output relationships and data distribution rules among multiple components, and to configure the resources of each component. Each output port can choose data distribution rules as broadcast, equalization or assignment. Each component can be set to run one or more instances. If there are multiple instances, the number of instances will be adjusted adaptively and dynamically according to the running conditions of components, so as to realize data parallel and load balancing among multiple instances of the same component.

2.2 Task Scheduling Strategy for Multi-core and Multi Processor

The composition of multi-core multi processor task scheduling framework is shown in Fig. 3.

The framework consists of three parts: component flow management software, component container software and component. In Fig. 3, the same filling color belongs to the same component flow task, and the system supports multiple tasks running at the same time. The operation of a component flow needs a component flow driver software for overall control and management, to achieve component flow analysis, resource application and component control. The components in the same component flow are controlled by a component container on a computing node to realize the functions of component loading, task splitting, data distribution, component calling, etc., which will not increase the traffic and delay. In the framework of component-based parallel computing, there are three cases to use multi-core: different cores run different serial component instances, different cores run multiple instances of the same serial component, and multi-core parallelism within a component. In view of the above two cases, CPU establishes thread pool through multitasking for multi-core parallel processing. GPU realizes the data transmission between CPU and GPU through multi thread and multi stream, and improves the processing efficiency of GPU through parallelism.

According to the number of two adjacent components and the data distribution strategy of the output port of the previous component, there are the following few scenarios: 1-to-1, 1-to-N broadcast, 1-to-N balance, N-to-1, M-to-N balance, and N-to-N balance, etc. Some data distribution scenarios are shown in Fig. 4. There are three kinds of location relationships between the two components: running on different processors, loaded by the same process, and running on different cores. Therefore, there are three communication modes: network communication, in-process communication and inter core communication. The priority order is in-process, inter core and network.

3 Component Flow Framework

The component flow framework and its deployment are shown in Fig. 5, including hardware platform, distributed computing platform and application layer.

Hardware platform includes heterogeneous hardware layer and the operating system layer above it. The former is composed of CPU and GPU processors. The latter runs on CPU processor and can be windows and Linux operating system. Distributed computing platform includes three parts. The virtualization layer shields the influence of the hardware platform on the components through the component model, which makes the processor hardware universal and simple, and automatically realizes the dynamic component reconfiguration and multi-core parallel. The resource management layer is responsible for the monitoring, scheduling and management of CPU and GPU resources. It abstracts CPU and GPU processors into unified resource pools to achieve automatic deployment, automatic startup, dynamic monitoring and dynamic optimization of resources. Task management layer is responsible for task scheduling and management. It analyses the configuration of component flow graph, applies for computing resources from resource management layer, calls processing functions for real-time parallel computing, and achieves load balancing among multiple instances of the same component. The application layer is the user component developed for users or the component flow diagram used in the actual scene.

The system composition is shown in Fig. 6. The computing cluster is composed of multiple computing nodes to realize the visual monitoring of resource status. CFSM is the system management module. The function is to summarize the resource information of all computing nodes, realize component management, provide component upload, download, delete functions, and provide component flow operation record storage function. CFNA is the node agent module. The function is to manage the component container on the node, collect the resource information of the node and report to CFSM. Cfdriver is component flow driver. It has four functions: (1) parsing component flow and applying for computing resources from CFSM, (2) Deploy the components in the component flow to the applied computing nodes -- start cfcontainer, (3) Build the data transfer network between each cfcontainer and start the component flow calculation, (4) Monitor the running status of component flow. Cfcontainer is the component container. The functions are: loading and initializing components, receiving data and calling component processing functions, uploading the status of each component to cfdriver regularly. Cfclient is the system client. The functions are: (1) provides cluster status monitoring interface, (2) Provide component management function, users can upload, download or delete components in the interface, (3) The component flow operation monitoring function can view the real-time operation record or history record of component flow information.

4 Results Analysis

Based on the above research, the framework based on component flow is applied into a radar signal processing, as showed in Fig. 7, which included the display and control component, amplification component, IQ component and sampling component, and so on (Table 1).

Table 1. Performance results of each sub algorithm. (4096 points for segmented FFT transform)

Full size table

The performance test results of each sub algorithm in Fig. 7 are shown in Fig. 1. The performance index is the time from the beginning to the end of each sub algorithm process, and the total number of cycles is 10000 (unit: ms). This paper tests the performance of four modes: single card single thread, single card multi thread, single card single thread multi stream, and single card multi thread multi stream, as shown in Fig. 8. For convenience, each data channel takes an input signal of the same length (16 pulses). The performance index is the time from the beginning to the end of all channel data processing, including interface function initialization, input signal data transmission to the video memory, signal process processing, and processing results transmission back to the host memory. Loop “input+process+output” code for 10000 times, and count the average performance. As a comparison, the performance of single channel data cycle test is 2.093 ms.

It can be seen from Fig. 8 that the final performance of using stream mode is better than that of not using stream mode, which indicates that the underlying hardware working mechanism of GPU plays a decisive role in the performance of data processing. The performance of single card single thread mode and single card multi thread mode is almost the same, because when there is no stream mode, API calls use the default null stream, and all CUDA operations in the stream are executed in sequence. When using stream mode, it is faster than using the default null stream, which should be related to the performance improvement of the non pageable memory of the host matching the asynchronous data transmission of the stream. The performance of single thread multi stream is almost the same as that of multi thread multi stream. This is because each step of “input+processing+output” in multi stream test is called asynchronously, so it will not significantly affect the delivery efficiency of related CUDA operations. However, the performance of the latter is slightly better than that of the former, because it is always more efficient for multi CPU threads to compute and deliver CUDA operation commands to the stream.

5 Conclusion

In order to improve the speed and efficiency of the computer, the research is carried out on the CPU+GPU heterogeneous computing cluster. This paper studies the component flow model, uses multi-core multi processor to achieve the dynamic scheduling of tasks, and builds a heterogeneous computing framework system of radar multi signal real-time simulation. This paper abstractly separates the algorithm from the specific hardware environment and operating system through components and component flow, which adapts to the different processor types of CPU and GPU, and realizes the scalability and reconfiguration of the system. The results show that the CPU+GPU heterogeneous framework based on component flow can make full use of heterogeneous multiprocessor computing resources, improve simulation efficiency, and has the characteristics of good portability and reusability.

References

Asano, S., Maruyama, T., Yamaguchi, Y.: Performance comparison of FPGA, GPU and CPU in image processing. In: International Conference on Field Programmable Logic and Applications, pp. 126–131 (2009)
Google Scholar
Segal, O., Nasiri, N., Margala, M., Vanderbauwhede, W.: High level programming of FPGAs for HPC and data centric applications. In: IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–3 (2014)
Google Scholar
Dittmann, F., Gotz, M.: Applying single processor algorithms to schedule tasks on reconfigurable devices respecting reconfiguration times. In: Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, p. 4 (2006)
Google Scholar
Paik, Y., Han, M., Choi, K.H., Kim, M., Kim, S.W.: Cycle-accurate full system simulation for CPU+GPU+HBM computing platform, International Conference on Electronics, Information, and Communication (ICEIC), pp. 1–2 (2018)
Google Scholar
Rai, S., Chaudhuri, M.: Improving CPU performance through dynamic GPU access throttling in CPU-GPU heterogeneous processors. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 18–29 (2017)
Google Scholar
Di, Y., Weiyi, S., Ke, S., Zibo, L.: A high-speed digital signal hierarchical parallel processing architecture based on CPU-GPU platform. In: IEEE 17th International Conference on Communication Technology (ICCT), pp. 355–358 (2017)
Google Scholar
Wei, C.: Research on Key Technologies of large scale CFD efficient CPU/GPU heterogeneous parallel computing. University of Defense Science and Technology (2014)
Google Scholar
Dev, K., Reda, S.: Scheduling challenges and opportunities in integrated CPU+GPU processors. In: 2016 14th ACM/IEEE Symposium on Embedded Systems For Real-time Multimedia (ESTIMedia), pp. 1–6 (2016)
Google Scholar
Kirk, D.B.: Multiple cores, multiple pipes, multiple threads - do we have more parallelism than we can handle? In: IEEE Hot Chips XVII Symposium (HCS), pp. 1–38 (2005)
Google Scholar
Jingui, H., Jianer, C., Songqiao, C.: parallel task scheduling in network cluster computing system. Acta Comput. Sin. 27(6), 765–771 (2004)
Google Scholar
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer System, pp. 265–278 (2010)
Google Scholar
Siklosi, B., Reguly, I.Z., Mudalige, G.R.: Heterogeneous CPU-GPU execution of stencil applications. In: IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pp. 71–80 (2018)
Google Scholar
Sourouri, M., Langguth, J., Spiga, F., Baden, S.B., Cai, X.: Cpu+gpu programming of stencil computations for resource-efficient use of gpu clusters. In: 2015 IEEE 18th International Conference on Computational Science and Engineering, pp. 17–26, October 2015
Google Scholar
Pereira, A.D., Ramos, L., Ges, L.F.W.: Pskel: a stencil programming framework for cpu-gpu systems. Concurrency and Computation: Practice and Experience, 27(17) (2015)
Google Scholar
Pereira, A.D., Rocha, R.C.O., Ramos, L., Castro, M., Ges, L.F.W.: Automatic partitioning of stencil computations on heterogeneous systems. In: 2017 International Symposium on Computer Architecture andHigh Performance Computing Workshops (SBAC-PADW), pp. 43–48, October 2017
Google Scholar

Download references

Acknowledgements

This work was supported by Science and Technology on Electronic Information Control Laboratory Program (Grand No. 6142105190310) and Sichuan Science and Technology Program (Grand No. 2020YFG0390).

Author information

Authors and Affiliations

Science and Technology on Electronic Information Control Laboratory, Chengdu, China
Jianqing Li
School of Electronic Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China
Jianqing Li & Hongli Li
Chengdu Haiqing Technology Co., Ltd., Chengdu, China
Jing Li, Jianmin Chen, Kai Liu, Zheng Chen & Li Liu

Authors

Jianqing Li
View author publications
You can also search for this author in PubMed Google Scholar
Hongli Li
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianmin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Kai Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianqing Li .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J. et al. (2022). Distributed Heterogeneous Parallel Computing Framework Based on Component Flow. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_45

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_45
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics