Keywords

1 Introduction

High performance computing (HPC) is the basic technology of information technology, and the key technology to promote information networking. With the diversified development of chip technology, there are so many kinds of high-performance processors, including CPU, GPU, MIC, FPGA, etc.. Each of these processors is suitable for different application scenarios or algorithms [1, 2]. The current simple computing mode of single processor can not meet the complex work requirements [3]. In order to improve the hardware processing capacity, we usually take CPU as the main control and connect GPU, MIC, FPGA and CPU through PCIE bus to accelerate the computing tasks, that is, the heterogeneous computing mode of CPU+X. Among them, the heterogeneous computing mode of CPU+GPU is the most mature and has the best performance [4]. The peak performance of NVIDIA Tesla V100 GPU reaches 15TFlops. Compared with the traditional CPU, the GPU-accelerated server can improve the calculation speed by dozens of times under the same computational accuracy [5, 6]. Therefore, this paper studies the heterogeneous computing cluster of CPU+GPU. However, the heterogeneous computing of CPU+GPU brings two new problems [7, 8], including distributed computing resource scheduling strategy and task scheduling strategy between CPU and GPU. For these two problems, we can use multi-core multi processor to solve [9]. The full application of multi-core and multi-processor involves multi-core resource scheduling, multi-task scheduling, inter-processor communication, load balancing, etc.. Optimal scheduling of parallel tasks on multiple processors has been proven to be NP-hard [10]. TDS (Task Duplication Scheduling) [11] divides all tasks into multiple paths according to the dependency topology, and the tasks on each path are executed as a group on one processor. Although this method reduces the delay and shortens the running time, it will increase the energy consumption. In addition, the hardware structure, application and development mode of CPU and GPU processor are different, resulting in poor portability [12]. Sourouri [13] used a simple 3D 7-point stencil computation and statically partition the suitable workload between CPU and GPU to show 1.1–1.2 times of acceleration. Pereira [14] demonstrated a simple static load balancing between CPU and GPU on a single template application, showing up to 1.28 acceleration. Then, Pereira [15] used time tiling on the same pskel framework to reduce the communication requirements between CPU and GPU, but increased redundant computing. Most of them use static load balancing, only consider a single (often repeated) mold, it is difficult to extend to larger applications, with poor reusability.

In view of the above contents, this paper researches on component flow, multi-core multi processor and real-time computing process. Firstly, based on the model of component flow, the model and function of components and component flow suitable for CPU and GPU heterogeneous parallel computing are determined. Then, based on multi-core and multi processor, the task scheduling strategy, data distribution strategy and multi-core parallel strategy are explored. Finally, on the basis of radar signal level simulation, the CPU+GPU heterogeneous computing framework system based on the simulation model is proposed and verified. The results show that the CPU+GPU heterogeneous framework based on component stream can make full use of the computing resources of heterogeneous multiprocessors, improve the computing speed and efficiency of radar signal simulation, realize the automatic distribution and load balancing on multiple computers through components, and has the characteristics of good portability, strong reusability and fast computing speed.

2 Component Flow Model

2.1 Component Flow Model

Developing algorithms directly on CPU and GPU processors will lead to poor reusability and portability of algorithms. Therefore, this paper studies the model based on component flow to realize the algorithm reuse. A component is an abstract model of a computing function, as shown in Fig. 1. The numbers on the left and right represent the serial numbers of the input and output ports respectively. The component model also includes initialization function and processing function, which are automatically called when initialization and data arrive, respectively. Component container is a process running on CPU, which is responsible for data communication between processors, dynamic loading and initialization of local components, and providing versions of operating system. The component flow diagram defines the data flow and temporal relationship between components, and realizes the specific algorithm logic. As shown in Fig. 2, the component flow diagram of an application is used to configure the data input and output relationships and data distribution rules among multiple components, and to configure the resources of each component. Each output port can choose data distribution rules as broadcast, equalization or assignment. Each component can be set to run one or more instances. If there are multiple instances, the number of instances will be adjusted adaptively and dynamically according to the running conditions of components, so as to realize data parallel and load balancing among multiple instances of the same component.

Fig. 1.
figure 1

Component diagram.

Fig. 2.
figure 2

Component flow diagram of an application.

2.2 Task Scheduling Strategy for Multi-core and Multi Processor

The composition of multi-core multi processor task scheduling framework is shown in Fig. 3.

Fig. 3.
figure 3

Multi-core multi processor task scheduling framework

The framework consists of three parts: component flow management software, component container software and component. In Fig. 3, the same filling color belongs to the same component flow task, and the system supports multiple tasks running at the same time. The operation of a component flow needs a component flow driver software for overall control and management, to achieve component flow analysis, resource application and component control. The components in the same component flow are controlled by a component container on a computing node to realize the functions of component loading, task splitting, data distribution, component calling, etc., which will not increase the traffic and delay. In the framework of component-based parallel computing, there are three cases to use multi-core: different cores run different serial component instances, different cores run multiple instances of the same serial component, and multi-core parallelism within a component. In view of the above two cases, CPU establishes thread pool through multitasking for multi-core parallel processing. GPU realizes the data transmission between CPU and GPU through multi thread and multi stream, and improves the processing efficiency of GPU through parallelism.

According to the number of two adjacent components and the data distribution strategy of the output port of the previous component, there are the following few scenarios: 1-to-1, 1-to-N broadcast, 1-to-N balance, N-to-1, M-to-N balance, and N-to-N balance, etc. Some data distribution scenarios are shown in Fig. 4. There are three kinds of location relationships between the two components: running on different processors, loaded by the same process, and running on different cores. Therefore, there are three communication modes: network communication, in-process communication and inter core communication. The priority order is in-process, inter core and network.

Fig. 4.
figure 4

Partial data distribution strategy

3 Component Flow Framework

The component flow framework and its deployment are shown in Fig. 5, including hardware platform, distributed computing platform and application layer.

Fig. 5.
figure 5

The component flow framework

Fig. 6.
figure 6

System composition.

Hardware platform includes heterogeneous hardware layer and the operating system layer above it. The former is composed of CPU and GPU processors. The latter runs on CPU processor and can be windows and Linux operating system. Distributed computing platform includes three parts. The virtualization layer shields the influence of the hardware platform on the components through the component model, which makes the processor hardware universal and simple, and automatically realizes the dynamic component reconfiguration and multi-core parallel. The resource management layer is responsible for the monitoring, scheduling and management of CPU and GPU resources. It abstracts CPU and GPU processors into unified resource pools to achieve automatic deployment, automatic startup, dynamic monitoring and dynamic optimization of resources. Task management layer is responsible for task scheduling and management. It analyses the configuration of component flow graph, applies for computing resources from resource management layer, calls processing functions for real-time parallel computing, and achieves load balancing among multiple instances of the same component. The application layer is the user component developed for users or the component flow diagram used in the actual scene.

The system composition is shown in Fig. 6. The computing cluster is composed of multiple computing nodes to realize the visual monitoring of resource status. CFSM is the system management module. The function is to summarize the resource information of all computing nodes, realize component management, provide component upload, download, delete functions, and provide component flow operation record storage function. CFNA is the node agent module. The function is to manage the component container on the node, collect the resource information of the node and report to CFSM. Cfdriver is component flow driver. It has four functions: (1) parsing component flow and applying for computing resources from CFSM, (2) Deploy the components in the component flow to the applied computing nodes -- start cfcontainer, (3) Build the data transfer network between each cfcontainer and start the component flow calculation, (4) Monitor the running status of component flow. Cfcontainer is the component container. The functions are: loading and initializing components, receiving data and calling component processing functions, uploading the status of each component to cfdriver regularly. Cfclient is the system client. The functions are: (1) provides cluster status monitoring interface, (2) Provide component management function, users can upload, download or delete components in the interface, (3) The component flow operation monitoring function can view the real-time operation record or history record of component flow information.

4 Results Analysis

Based on the above research, the framework based on component flow is applied into a radar signal processing, as showed in Fig. 7, which included the display and control component, amplification component, IQ component and sampling component, and so on (Table 1).

Fig. 7.
figure 7

Flow diagram of radar signal processing.

Fig. 8.
figure 8

Performance of multi-channel data processing mode.

Table 1. Performance results of each sub algorithm. (4096 points for segmented FFT transform)

The performance test results of each sub algorithm in Fig. 7 are shown in Fig. 1. The performance index is the time from the beginning to the end of each sub algorithm process, and the total number of cycles is 10000 (unit: ms). This paper tests the performance of four modes: single card single thread, single card multi thread, single card single thread multi stream, and single card multi thread multi stream, as shown in Fig. 8. For convenience, each data channel takes an input signal of the same length (16 pulses). The performance index is the time from the beginning to the end of all channel data processing, including interface function initialization, input signal data transmission to the video memory, signal process processing, and processing results transmission back to the host memory. Loop “input+process+output” code for 10000 times, and count the average performance. As a comparison, the performance of single channel data cycle test is 2.093 ms.

It can be seen from Fig. 8 that the final performance of using stream mode is better than that of not using stream mode, which indicates that the underlying hardware working mechanism of GPU plays a decisive role in the performance of data processing. The performance of single card single thread mode and single card multi thread mode is almost the same, because when there is no stream mode, API calls use the default null stream, and all CUDA operations in the stream are executed in sequence. When using stream mode, it is faster than using the default null stream, which should be related to the performance improvement of the non pageable memory of the host matching the asynchronous data transmission of the stream. The performance of single thread multi stream is almost the same as that of multi thread multi stream. This is because each step of “input+processing+output” in multi stream test is called asynchronously, so it will not significantly affect the delivery efficiency of related CUDA operations. However, the performance of the latter is slightly better than that of the former, because it is always more efficient for multi CPU threads to compute and deliver CUDA operation commands to the stream.

5 Conclusion

In order to improve the speed and efficiency of the computer, the research is carried out on the CPU+GPU heterogeneous computing cluster. This paper studies the component flow model, uses multi-core multi processor to achieve the dynamic scheduling of tasks, and builds a heterogeneous computing framework system of radar multi signal real-time simulation. This paper abstractly separates the algorithm from the specific hardware environment and operating system through components and component flow, which adapts to the different processor types of CPU and GPU, and realizes the scalability and reconfiguration of the system. The results show that the CPU+GPU heterogeneous framework based on component flow can make full use of heterogeneous multiprocessor computing resources, improve simulation efficiency, and has the characteristics of good portability and reusability.