This chapter mainly introduces Huawei Ascend AI processor and Huawei Atlas AI computing solution, focusing on the hardware and software architecture of Ascend AI processor and Huawei’s full-stack and all-scenario AI solution.

6.1 The Hardware Architecture of Ascend AI Processor

6.1.1 The Logic Architecture of Ascend AI Processor Hardware

The logic architecture of Ascend AI processor hardware is mainly composed of four modules, i.e., Control CPU, AI computing engine (including AI Core and AI CPU), multi-level system-on-chip cache or Buffer, Digital Vision Pre-Processing (DVPP), etc., as shown in Fig. 6.1. The following will focus on the AI Core of AI computing engine, namely, Da Vinci Architecture.

Fig. 6.1
figure 1

The logic architecture of Ascend AI processor hardware

6.1.2 Da Vinci Architecture

Da Vinci Architecture, which is both the AI computing engine and the core of the Ascend AI processor, is specially developed to enhance AI computing power.

Da Vinci Architecture is mainly composed of three parts: computing unit, memory system and control unit.

  1. 1.

    The computing unit includes three basic computing resources: Cube Unit, Vector Unit and Scalar Unit.

  2. 2.

    The memory system includes AI Core’s on-chip memory unit and the corresponding data path.

  3. 3.

    The control unit provides command control for the whole calculation process, which is equivalent to the headquarters of AI core and is responsible for the operation of the whole AI Core.

Da Vinci Architecture is shown in Fig. 6.2.

Fig. 6.2
figure 2

Da Vinci Architecture

  1. 1.

    Computing Unit

    There are three basic computing units in Da Vinci Architecture: Cube Unit, Vector Unit and Scalar Unit, which respectively correspond to the three common computing modes, namely, cube, vector and scalar, as shown in Fig. 6.3.

    Cube Unit: The main function of Cube Unit and accumulator is to complete matrix correlation operations. One beat completes 16 × 16 and 16 × 16 matrix multiplication (4096) of one FP16; if the input data belongs to Int 8 type, then one beat will complete 16 × 32 and 32 × 16 matrix multiplication (8192).

    Vector Unit: It implements the calculation between vector and scalar, as well as dual vector. Its functions cover various basic calculation types and many customized calculation types, such as FP16, FP32, Int32, Int8 and other data types.

    Scalar Unit: It is equivalent to a micro CPU, which controls the entire AI Core operation, completing the cycle control and branch judgment of the whole program, providing data address and related parameter calculation for matrix and vector, as well as basic arithmetic operation.

  2. 2.

    Memory System

    Memory Unit and the corresponding data path constitute Memory System of Da Vinci Architecture, as shown in Fig. 6.4.

    1. (a)

      Memory Unit is composed of memory control unit, buffer and register.

      • Storage Control Unit: A direct access to lower-level caches in addition to AI Core can be achieved via bus interface. It can also directly access the memory through Double Data Rate Synchronous Dynamic Random Access Memory (SDRAM, DDR for short) or High Bandwidth Memory (HBM). The memory migration unit, which is responsible for the read-write management of the internal data of AI Core between different buffers, is also set as the transmission controller of the internal data path of AI Core, completing a series of format conversion operations, such as filling, Img2Col, transposition, extract, etc.

      • Input Buffer: It is used to temporarily reserve the data that needs frequent use, which avoids external reading to AI Core via the bus interface every time. The frequency of data access on the bus is reduced as well as the risk of data congestion on the bus, so as to minimize power dissipation and improve performance.

      • Output Buffer: It is used to store the intermediate results of each calculation layer in the neural network, for obtaining data conveniently when entering the next layer. In contrast, the bandwidth of reading data through the bus is low and the delay is large, and the calculation efficiency can be greatly improved through the output buffer.

      • Register: All kinds of register resources in AI Core are mainly used by scalar units.

    2. (b)

      Data path refers to the flow path of data in AI Core when AI Core completes a computing task.

      Da Vinci Architecture data path is characterized by multiple-input and single-output, which is mainly due to the various and numerous input data in the computing process of neural network. Parallel input can be used to promote the efficiency of data inflow. On the contrary, only the output characteristic matrix is generated after the procession of multiple input data, and the data type is relatively unitary. Therefore, single-output data path can save chip hardware resources.

  3. 3.

    Control Unit

    The control unit is made up of system control module, command cache, scalar command processing queue, command transmission module, matrix execution queue and event synchronization module, as shown in Fig. 6.5.

    1. (a)

      System Control Module: It controls the execution process of task block (the minimum computing task granularity in AI Core). After the execution of task block, the system control module will interrupt processing and declare the state. If an error occurs in the execution process, the error status will be reported to Task Scheduler.

    2. (b)

      Command Cache: In the process of command execution, subsequent commands can be prefetched in advance, and multiple commands can be read into the cache at one time to improve the efficiency of command execution.

    3. (c)

      Scalar Command Processing Queue: After the command is decoded, it will be imported into scalar queue to realize address decoding and computing control. The commands include matrix computing command, vector computing command and memory conversion command.

    4. (d)

      Command Transmission Module: It reads the command address and decoding parameters configured in the scalar command queue, and then send them to the corresponding command execution queue according to their command types, while the scalar command resides in the scalar command processing queue for subsequent execution.

    5. (e)

      Command Execution Queue: It is composed of matrix queue, vector queue and memory conversion queue. Different commands enter different queues, and the commands in the queue are executed in the order of entry.

    6. (f)

      Event Synchronization Module: It controls the executing state of each command pipeline at any time, analyzing the dependencies of different pipelines, so as to adjust the data dependence and synchronization between command pipelines.

Fig. 6.3
figure 3

Da Vinci Architecture—computing unit

Fig. 6.4
figure 4

Da Vinci Architecture—memory system

Fig. 6.5
figure 5

Da Vinci Architecture—control unit

6.2 The Software Architecture of Ascend AI Processor

6.2.1 The Logic Architecture of Ascend AI Processor Software

The software stack of Ascend AI processor is mainly divided into four levels and one auxiliary tool chain. The four layers are L3 application enabling layer, L2 execution framework layer, L1 chip enabling layer and L0 computing resource layer. The tool chain mainly provides auxiliary capabilities such as engineering management, compilation and debugging, matrix, log and profiling. The main components of the software stack depend on each other in functions, carrying data flow, calculation flow and control flow, as shown in Fig. 6.6.

Fig. 6.6
figure 6

The logic architecture of ascend AI processor software

  1. 1.

    L3 Application Enabling Layer

    L3 application enabling layer is application-level encapsulation, which provides different processing algorithms for specific applications, in addition to offering the engine of computing and processing for various fields. It can also directly use the framework scheduling capability provided by L2 execution framework layer, and generate the corresponding neural network through the general framework to realize specific engine functions.

    L3 application enabling layer includes computer vision engine, language engine, general business execution engine, etc.

    1. (a)

      Computer Vision Engine: It provides some video or image processing algorithms encapsulation, which is specialized to deal with algorithms and applications in the field of computer vision.

    2. (b)

      Language Engine: It provides some basic processing algorithm encapsulation for voice, text and other data, in which language processing can be carried out according to specific application scenarios.

    3. (c)

      General Business Execution Engine: It provides general neural network reasoning ability.

  2. 2.

    L2 Executive Framework Layer

    L2 executive framework layer is the encapsulation of framework calling ability and offline model generation capability. After the L3 application enabling layer develops the application algorithm and encapsulates it into an engine, the calls (such as Caffe or Tensorflow) suitable for deep learning framework will be made according to the characteristics of relevant algorithms, then the neural network with corresponding functions is obtained, and the offline model (OM) is generated by Framework Manager.

    L2 executive framework layer contains the framework manager and the process choreographer.

    1. (a)

      Framework Manager contains Offline Model Generator (OMG), Offline Model Executor (OME) and Offline Model Reasoning Interface, which supports model generation, loading, unloading and reasoning.

      The online framework generally uses mainstream deep learning open source frameworks (such as Caffe, Tensorflow, etc.) to accelerate computations on Ascend AI processors through offline model transformation and loading.

      While the offline framework refers to for Ascend AI processor, L2 executive framework layer provides offline generation and executive capability of neural network, which can be separated from deep learning open source framework (such as Caffe, TensorFlow, etc.) to enable offline model to have the same capability (mainly reasoning ability).

      • Offline Model Generator is responsible for transforming the models trained by Caffe or TensorFlow into offline models supported by Ascend AI processor.

      • Offline Model Executor is responsible for loading and unloading the off-line model, converting the successfully loaded model file into an executable instruction sequence on Ascend AI processor, and completing the program compilation before execution.

    2. (b)

      Matrix: It provides developers with a development platform for deep learning computing, including computing resources, operating framework and related supporting tools, which allows developers to conveniently and efficiently write AI applications running on specific hardware devices, responsible for the generation, loading and scheduling of models.

      After L2 executive framework layer transforms the original model of neural network into an offline model that can run on Ascend AI processor, Offline Model Executor transmits the offline model to L1 chip enabling layer for task allocation.

  3. 3.

    L1 Chip Enabling Layer

    L1 chip enabling layer is the bridge between offline model and Ascend AI processor. After receiving the offline model generated by L2 execution framework layer, L1 chip enabling layer will provide acceleration function for offline model calculation through Acceleration Library based on different computing tasks.

    L1 chip enabling layer is the layer closest to the underlying computing resources, responsible for dispatching operator-level tasks to hardware. It is mainly composed of Digital Vision Pre-Processing (DVPP), Tensor Boost Engine (TBE), Runtime, Driver and Task Scheduler.

    In L1 chip enabling layer, TBE of the chip acts as the core, which supports online and offline model acceleration, including standard operator acceleration library and custom operator capabilities. The tensor acceleration engine includes the standard operator acceleration library, and these operators have good performance after optimization. The operator interacts with Runtime on the upper layer of the operator accelerator library in the executive process. At the same time, Runtime communicates with L2 executive framework layer, providing standard operator acceleration library interface to perform the calling, so that the specific network model can find the optimized, executable and accelerated operators for the optimal implementation of functions. If there is no operator needed by L2 executive framework layer in the standard operator acceleration Library of L1 chip enabling layer, new custom operators can be written by TBE to meet the needs of L2 executive framework layer. Therefore, TBE provides fully functional operators for L2 executive framework layer by offering standard operator library and the capability to customize operators.

    Under TBE is Task Scheduler. After specific computing kernel function is generated based on the corresponding operator, Task Scheduler processes and distributes the corresponding computing kernel function to the AI CPU or AI Core according to specific task types, activating the hardware through the driver. Task Scheduler itself runs on a dedicated CPU core.

    Digital Vision Pre-Processing (DVPP) module is a multi-functional capsule for the field of image and video. In the case of pre-processing common image or video scene, the module provides the upper layer with various data pre-processing capabilities by using the bottom dedicated hardware.

  4. 4.

    L0 Computing Resource Layer

    L0 computing resource layer is the hardware computing power foundation for Ascend AI processor, which provides computing resources and performs specific computing tasks.

    After L1 chip enabling layer completes the distribution of tasks corresponding to operators, the execution of specific computing tasks is started by L0 computing resource layer.

    L0 computing resource layer is composed of operating system, AI CPU, AI Core and DVPP dedicated hardware modules.

    AI Core is the computing core of Ascend AI processor, which mainly completes the matrix correlation calculation of neural network, while AI CPU performs the general calculation of control operator, scalar and vector. If the input data needs to be pre-processed, DVPP dedicated hardware modules will be activated to pre-process the image and video data, providing data formats to AI Core to meet the computing requirements in specific scenarios.

    AI Core is mainly responsible for large computing tasks while AI CPU is mainly responsible for complex computing and executive control, and DVPP hardware is responsible for data pre-processing. The role of the operating system is to make the three closely assist each other and form a perfect hardware system, which provides executive guarantee for deep neural network computing of Ascend AI processor.

  5. 5.

    Tool Chain

    Tool Chain, designed for the convenience of programmers is a set of tool platform supporting Ascend AI processor. It provides support for the development, debugging, network transplantation, optimization and analysis of custom operators. In addition, a set of desktop programming services is provided in the programmer-oriented programming interface, which lowered the barrier of entry for the development of deep neural network related applications.

    It is composed of engineering management, compilation and testing, matrix, offline model conversion, comparison, log, profiling, custom operators, etc. Therefore, tool chain provides multi-level and multi-functional convenient services for application development and implementation on this platform.

6.2.2 The Neural Network Software Flow of Ascend AI Processor

The neural network software flow of Ascend AI Processor is a bridge between deep learning framework and Ascend AI processor, which provides a shortcut for neural network to transform from original model to intermediate computing graph representation, and then to independent offline model.

The neural network software flow of Ascend AI processor is mainly used to generate, load and execute an offline model of neural network application. The neural network software flow of Ascend AI processor gathers functional modules such as Matrix, DVPP module, Tensor Boost Engine, Framework, Runtime and Task Scheduler, thus forming a complete functional cluster.

The neural network software flow of Ascend AI processor is shown in Fig. 6.7.

Fig. 6.7
figure 7

The neural network software flow of Ascend AI processor

  1. 1.

    Matrix: It is responsible for the landing and implementation of neural network on Ascend AI processor, coordinating the whole effective process of neural network, governing the loading and executive process of offline model.

  2. 2.

    DVPP Module: It conducts a data processing and modification prior to input to meet the needs of computing format.

  3. 3.

    Tensor Boost Engine: As an arsenal of neural network operators, it provides a steady stream of powerful computing operators for neural network models.

  4. 4.

    Framework: It builds the original neural network model into the form supported by Ascend AI processor, and the model is integrated with Ascend AI processor to guide the neural network to run and give full play to its performance.

  5. 5.

    Runtime: It provides various resource management channels for task distribution and allocation of neural network.

  6. 6.

    Task Scheduler: As a task driver of hardware execution, it provides specific target tasks for Ascend AI processor; Runtime and Task Scheduler interact together to form a dam system of neural network task flow system to hardware resources, real-time monitoring and effectively distributing different types of executive tasks.

The whole neural network software provides a fully functional executive process to Ascend AI processor that combines hardware and software, helping the development of related AI applications. The functional modules related to neural network will be introduced separately as follows.

6.2.3 Introduction to the Functional Modules of Ascend AI Processor Software Flow

  1. 1.

    Tensor Boost Engine

    In the construction of neural networks, operators constitute network structures with different application functions. As an arsenal of operators, Tensor Boost Engine (TBE) provides operator development capabilities for neural networks based on Ascend AI processor. The operators written in TBE language are used to construct various neural network models. At the same time, TBE also provides operators with the capabilities to wrap callings. TBE contains an optimized TBE standard operator library of neural network, which can be directly utilized by developers to achieve high-performance of neural network computing. In addition, TBE also provides the fusion capability of TBE operators, opening up a unique path for neural network optimization.

    TBE provides the capability to develop custom operators based on Tensor Virtual Machine (TVM). Users can complete the development of corresponding neural network operators through TBE language and custom operator programming development interface. TBE includes Domain-Specific-language (DSL) Module, Schedule Module, Intermediate Representation (IR) Module, Compiler Transfer Module and CodeGen Module. The structure of TBE is shown in Fig. 6.8.

    TBE operator development is divided into computational logic writing and scheduling development. DSL Module provides the programming interface of operator computational logic, directly based on the calculation process and scheduling process of operators based on domain specific language. The calculation process of operators describes the calculation methods and steps of operators, while the scheduling process describes the plan of data segmentation and data flow. Each operator calculation is processed according to the fixed data shape, which requires data shape segmentation in advance, for the operators executed on different computing units in Ascend AI processor, such as Matrix Unit, Vector Unit and the operators executed on AI CPU, have different requirements for input data shape.

    After the basic implementation process of the operator is defined, it is necessary to start Tiling Sub Module of Scheduling Module, segment the data in the operator according to the scheduling description, and specify the data handling process to ensure the optimal execution on the hardware. In addition to data shape segmentation, the operator fusion and optimization capability of TBE is also provided by Fusion Sub Module of Scheduling Module.

    After the operator is written, the intermediate representation needs to be generated for further optimization, and IR Module generates the intermediate representation through IR format similar to TVM. After the intermediate representation is generated, the module needs to be compiled and optimized for various application scenarios, with various optimization methods such as Double Buffer, Pipeline Synchronization, Memory Allocation Management, Command Mapping, Tiling Adapter Cube Unit, etc.

    After the operator is processed by Compiler Transfer Module, a temporary file of C-like code is generated by CodeGen Module. The temporary code file can be transferred into an operator implementation file by the compiler, which can be directly loaded and executed via Offline Model Executor.

    To sum up, a complete user-defined operator completes the whole development process through sub modules of TBE. After the operator prototype is formed by the operator calculation logic and scheduling description provided by the domain specific language module, the scheduling module performs data segmentation and operator fusion, entering the intermediate representation module to generate the intermediate representation of the operator. The compiler transfer module uses the intermediate representation to optimize the memory allocation. Finally, the code generation module generates C-like code for the compiler to compile directly. In the process of operator definition, TBE not only completes the compilation of operators, but also completes related optimization, which boosts the performance of operators.

    The three application scenarios of TBE are shown in Fig. 6.9.

    1. (a)

      Generally, the neural network model implemented by standard operators in deep learning framework has been trained by GPU or other types of neural network processors. If the neural network model continues to run on Ascend AI processor, it is anticipated to maximize its performance without changing the original code. Therefore, TBE provides a complete set of TBE operator boosting library. The operator functions in the library keep a one-to-one correspondence with the common standard operators in neural network, and the software stack provides a programming interface for calling operators, boosting various frameworks or applications in the upper-level deep learning, refraining from the development of Ascend AI processor underlying adaptation code.

    2. (b)

      If new operators appear in the neural network model construction, the standard operator library provided by TBE will not meet the development requirements. At this time, it is necessary to develop custom operators through TBE language, which is similar to CUDA C++ on GPU. More versatile operators can be realized and various network models can be flexibly programmed. The completed operators will be transferred to the compiler for compilation, and the final execution takes advantage of the chip acceleration capabilities on AI Core or AI CPU.

    3. (c)

      In appropriate scenarios, the operator fusion capability provided by TBE can promote the performance of the operator, so that the neural network operator can perform multi-level cache fusion based on buffers of different levels. Ascend AI processor can significantly improve the resource utilization rate when performing the fused operator.

    To sum up, because of the capabilities of operator development, standard operator calling and operator fusion optimization provided by TBE, Ascend AI processor can meet the needs of functional diversification in the actual neural network application. Moreover, the way of network construction will be more flexible and the fusion optimization capability will lead to a better performance.

  2. 2.

    Matrix

    1. (a)

      A Brief Introduction to the Function of Matrix.

      Ascend AI processor divides the execution level of network, regarding the execution of specific functions as a basic execution unit—Compute Engine. Each compute engine completes the basic operation function of the data in process arrangement, such as the image classification and so on. Compute Engine is customized by the developer to complete the required specific functions.

      Through the unified call of Matrix, the whole deep neural network application generally includes four engines: Data Engine, Pre-processing Engine, Model Inference Engine and Post-processing Engine, as shown in Fig. 6.10.

      • Data Engine mainly prepares the data set required by the neural network (such as MNIST data set) and processes the corresponding data (such as image filtering, etc.) as the data source of the subsequent compute engine.

      • Generally, the input media data needs to go through format pre-processing to meet the computing requirements of Ascend AI processor. Pre-processing Engine is mainly used to pre-process the media data, complete the encoding and decoding of images and videos, format conversion and other operations, and each function module of digital vision pre-processing needs to be called through Matrix uniformly.

      • Model Inference Engine is used in the neural network inference of data stream. It mainly uses the loaded model and the input data stream to complete the forward algorithm of neural network.

      • After Model Reasoning Engine outputs the results, Post-processing Engine performs subsequent processing on the output data, such as framing and labeling of image recognition.

        Figure 6.10 shows a typical flow chart of Compute Engine. Each specific data processing node in the flow chart of compute engine is the compute engine. When data flows through each engine according to the arranged path, related processing and calculation are respectively carried out, and the required results are finally output. The final output of the whole flow chart is the corresponding result of the calculation output of the neural network. The connection between two adjacent compute engine nodes is established through the configuration file in the flow chart of compute engine, and the actual data flow between nodes will flow according to the node connection mode of specific network model. After the configurations of node attributes are completed, the whole running process of the compute engine is initiated by pouring data into the starting node of the compute engine flow chart.

        Matrix runs above L1 chip enabling layer and below L3 application enabling layer, providing a unified standardized intermediate interface for a variety of operating systems (Linux, Android, etc.), which is responsible for the establishment, destruction and recycling of the whole compute engine flow chart.

        In the process of establishing the flow chart of the computing engine, Matrix completes the establishment of the flow chart of the compute engine according to the configuration file of the compute engine. Prior to execution, Matrix provides input data. If the input data such as video, image and other formats that can not meet the requirements of processing, the corresponding programming interface can be used to call the digital vision pre-processing module for data pre-processing. If the data meets the processing requirements, Offline Model Executor is directly called through the interface for reasoning calculation. In the process of execution, Matrix has the functions of multi-node scheduling and multi-process management, which is responsible for the operation of the computing process on the device side, guarding the computing process, and collecting the relevant execution information. After the execution of the model, Matrix will provide the application on the host with the function of obtaining the output results.

    2. (b)

      Application Scenarios of Matrix.

      As Ascend AI processor is targeted at different business needs, different hardware platforms can be built with different specificity. Depending on the collaboration of specific hardware and host side, the application of Matrix is used differently in typical scenarios such as Accelerator and Atlas 200 DK.

      • Accelerator Form of Application Scenario

        PCIe Accelerator based on Ascend AI processor is mainly oriented to data center and edge server scenarios, as shown in Fig. 6.11.

        PCIe Accelerator supports a variety of data accuracy, which has improved its performance compared with other similar accelerators, providing a greater computing power for neural network computing. In the scenario of accelerator, a host is required to be connected with the accelerator. The host can support various servers of PCIe plug-in cards and personal computers, performing the corresponding processing by call the neural network computing power of the accelerator.

        The function of Matrix in accelerator scenario is realized by three sub-processes: Matrix Agent, Matrix Daemon and Matrix service.

        Matrix Agent usually runs on the host. It can control and manage Data Engine and Post-processing Engine, complete with the data interaction with the host application, control the application, and communicate with the processing process on the device side.

        Matrix Daemon runs on the device side. It can complete the process establishment on the device according to the configuration file. It is responsible for starting and managing the process choreography process on the device, as well as the dissolution of the calculation process and resource recycle after the calculation is completed.

        Matrix Service runs on the device side. It controls Pre-processing Engine and Model Reasoning Engine on the device side. It can control Pre-processing Engine to call the programming interface of Digital Vision Pre-Processing Module to realize the pre-processing function of video and image data. It can also call the model manager programming interface in Offline Model Executor to load and reason the offline model.

        The offline model of neural network is reasoned through Matrix, and the computing process is shown in Fig. 6.12.

        The off-line model of neural network can be divided into the following three steps.

        Step 1: The flow chart to create compute engine. Through Matrix, the execution process of neural network is arranged by using different compute engines.

        Step 2: The flow chart to perform compute engine. According to the defined compute engine flow chart, the neural network function is calculated and implemented.

        After the offline model is loaded, the application notifies Matrix Agents on the host side to input the application data. The application program directly sends the data to the data engine for corresponding processing. If the incoming media data does not meet the computing requirements of Ascend AI processor, the pre-processing engine will start immediately and call the interface of DVPP module to pre-process the media data, such as encoding, decoding, scaling, etc. After pre-processing, the data is returned to the pre-processing engine, through which the data is transmitted to the model inference engine. At the same time, the model inference engine calls the processing interface of the model manager to complete the inferential calculation by combining the data with the loaded offline model. After the output results are obtained, the model inference engine calls the sending data interface of the process choreography unit to return the inferential results to the post-processing engine, which completes the post-processing operation of the data, and finally returns the post-processing data to the application program through the process choreographer unit. Thus, the execution calculation engine flow chart is completed.

        Step 3: The flow chart to destroy compute engine. After all calculations are completed, the system resources occupied by the compute engine are released.

        After all the engine data are processed and returned, the application program notifies Matrix Agent to release the data engine and post-processing engine computing hardware resources, while Matrix Agents notify Matrix Service to release the resources of the pre-processing engine and model inference engine. After all the resources are released, the flowchart of the compute engine is destroyed, and then Matrix Agents notify the application to implement the next neural network.

      • Atlas 200 DK Application Scenario

        Atlas 200 DK Application Scenario refers to Atlas 200 Developer Kit (Atlas 200 DK) scenario based on Ascend AI processor, as shown in Fig. 6.13.

        Atlas 200 DK opens up the core functions of Ascend AI processor through the peripheral interface on the developer board, which facilitates the direct external control and development of the chip, enabling the neural network processing ability of Ascend AI processor to be easily and intuitively played. Therefore, Atlas 200 DK, based on Ascend AI processor, can be used in a wide range of different artificial intelligence fields, which will be the backbone of mobile terminal hardware in the future.

        For Atlas 200 DK Application Scenario, the host control function is also on the developer board, and its logical architecture is shown in Fig. 6.14.

        As the function interface of Ascend AI processor, Matrix can complete the data interaction between the flow chart of the computing engine and the application program. According to the configuration file, Matrix establishes the flow chart of the computing engine, which is responsible for the scheduling, controlling and managing of the process. After the calculation, it also destroys the flow chart of the computing engine and recycles the resources after the calculation. In the process of pre-processing, Matrix calls the interface of pre-processing engine to realize the function of media pre-processing. In the process of inference, Matrix can also call the programming interface of the model manager to realize the offline model loading and inference. In Atlas 200 DK Application Scenario, Matrix coordinates the implementation of the whole computing engine flow chart without interacting with other devices.

  3. 3.

    Task Scheduler

    Together with Runtime, Task Scheduler (TS) forms the dam system between hardware and software. During execution, Task Scheduler drives the hardware tasks, providing specific target tasks for Ascend AI processor, completing the task scheduling process with Runtime, and sending the output data back to Runtime which acts as a channel for task delivery and data return.

    1. (a)

      Introduction to the Function of Task Scheduler

      Task Scheduler runs on the task scheduling CPU on the device side and is responsible for further dispatching specific tasks distributed by Runtime to AI CPU. It can also assign tasks to AI Core for execution through hardware Block Scheduler (BS), and return the results of task execution to Runtime after execution. Generally, the main tasks of Task Scheduler includes AI Core task, AI CPU task, Memory Replication, Event Recording, Event Waiting, Maintenance and Profiling.

      Memory Replication is mainly carried out asynchronously. Event Recording mainly records the occurrence information of the event. If there are tasks waiting for the event, these tasks can release the waiting and continue to execute after the event recording is completed, so as to eliminate the blocking of the execution flow caused by the event recording. Event Waiting means that if the waiting event has already occurred, the waiting task will be completed directly; if the waiting event has not yet occurred, the waiting task will be filled in the waiting list, and the processing of all subsequent tasks in the execution flow where the waiting event is located will be suspended. When the waiting event occurs, the processing of the waiting event will be executed.

      After the task is completed, Maintenance performs the corresponding maintenance according to different task parameters and recover the computing resources. In the process of execution, it is also possible to record and analyze the performance of the calculation, which requires Profiling to control the start and pause of the profiling operation.

      The functional framework of Task Scheduler is shown in Fig. 6.15. Task Scheduler is usually located on the device side, with its function completed by task scheduling CPU. Task scheduling CPU is composed of Interface, Engine, Logic Processing, AI CPU Scheduler, Block Scheduler, SysCtrl, Profile and Log.

      Task scheduling CPU realizes the communication and interaction between Runtime and Driver through scheduling interface. The task is transmitted to the task scheduling engine through the results. As the main body of task scheduling, the task scheduling engine is responsible for the process of task organization, task dependence and task scheduling control, and managing the execution process of the whole task scheduling CPU. According to the specific types of tasks, the task scheduling engine divides the tasks into three types: calculation, storage and control, which are distributed to different scheduling logic processing modules to start the management and scheduling of specific kernel function tasks, memory tasks and event dependencies between execution flows.

      The logic processing module is divided into Kernel Execute, DMA Execute and Event Execute. Kernel Execute carries out scheduling processing of computing tasks and realizes the scheduling logic of tasks on AI CPU and AI Core, scheduling specific kernel functions. DMA Execute implements the scheduling logic of storage tasks, and dispatches tasks such as memory replication and other tasks. Event Execute is responsible for the scheduling logic of synchronous control tasks and the logical processing of event dependencies between execution flows. After completing the scheduling logic processing of different types of tasks, it starts to be directly handed over to the corresponding control unit for hardware execution.

      For the task execution of AI CPU, AI CPU scheduler in the task scheduling CPU conducts state management and task scheduling of AI CPU by software. For the task execution of AI Core, the task scheduling CPU distributes the processed tasks to AI Core through a separate block scheduler hardware, and the specific calculation is carried out by AI Core. The calculated results are also returned to the task scheduling CPU by the block scheduler.

      In the process of task scheduling CPU, SysCtrl configures the system and initializes the chip function. At the same time, Profile and Log monitor the whole execution process and record the key execution parameters and specific execution details. At the end of the whole execution process or when an error is reported, specific performance profiling or error location can be carried out, providing a basis for the subsequent evaluation of implementation accuracy and efficiency.

    2. (b)

      The Scheduling Process of Task Scheduler.

      In the process of neural network’s offline model execution, Task Scheduler receives specific tasks from Offline Model Executor. There are dependencies among these tasks, which need to be removed first, followed by task scheduling and other steps, and finally distributed to AI Core or AI CPU according to the specific task type to complete the calculation or execution of specific hardware. In the process of task scheduling, a task is composed of multiple commands (CMD). Task Scheduler and Runtime interact with each other to complete the orderly scheduling of the whole task command. Runtime runs on the CPU of the host, as command queue is located in the memory of the device, and Task Scheduler issues specific commands.

      The detailed flow of task scheduler’s scheduling process is shown in Fig. 6.16.

      First, Runtime calls the driver’s dvCommandOcuppy interface to enter the command queue, querying the available memory space in the command queue according to the tail information of the command, and returning the address of the available memory space to Runtime. After receiving the address, Runtime fills the prepared command into the command queue memory space, and calls the driver’s dvCommandSend interface to update the current Tail information and Credit information of the command queue. After the queue receives a new command, a doorbell interrupt is generated and Task Scheduler is notified that a new command has been added to the command queue in the device memory. After Task Scheduler gets the notification, it enters the device memory, transferring the commands into the cache of the scheduler, and updating the header information of the command queue in the DDR memory of the device. Finally, Task Scheduler sends the commands in the cache to AI CPU or AI Core for execution.

      Similiar to the architecture of software stack in most accelerators, Runtime, Driver and Task Scheduler in Ascend AI processor cooperate closely to complete tasks orderly and distribute tasks to corresponding hardware resources for execution. The scheduling process provides the deep neural network computing process with a tight and orderly delivery of tasks, which ensures the continuity and efficiency of task execution.

  4. 4.

    Runtime

    The context of Runtime in the software stack is shown in Fig. 6.17. The upper layer of Runtime is TBE standard operator library and Offline Model Executor. TBE standard operator library provides neural network operators for Ascend AI processor, and Offline Model Executor is specially used to load and execute offline model. The lower layer of Runtime is driver, which interacts with Ascend AI processor.

    Runtime provides various call interfaces, such as memory interface, device interface, execution flow interface, event interface and execution control interface. Different interfaces are controlled by Runtime Engine to complete different functions, as shown in Fig. 6.18.

    Memory interface provides the application, release and replication of High Bandwidth Memory (HBM) or Double Data Rate (DDR) Memory on the device, including device-to-host, from host-to-device, and from device-to-device data replication. These memory replication can be divided into synchronous and asynchronous modes. Synchronous replication means that memory replication is completed before the next operation is performed, while asynchronous replication means that other operations can be performed at the same time.

    The device interface provides the query of the number and properties of the underlying devices, as well as the operation of selection and reset. After the offline model calls the device interface and selects a feature device, all tasks in the model will be executed on the selected device. If the task needs to be sent to other devices during the execution, the device interface needs to be called again for device selection.

    The execution flow interface provides the creation, release, priority definition, callback function setting, event dependency definition and synchronization of execution flow. These functions are related to the task execution within the execution flow, and the tasks within a single execution flow must be executed in sequence.

    If multiple execution flows need to be synchronized, the event interface needs to be called to create, release, record and define dependencies of synchronous events, so as to ensure that multiple execution flows can be completed synchronously and the final results of the model can be output. In addition to assign tasks or dependencies between execution flows, the event interface can also be used as a time mark to record the execution sequence.

    In the process of execution, the execution control interface is also used. Runtime Engine completes the loading of kernel functions and the distribution of memory asynchronous replication tasks through the execution control interface and Mailbox.

  5. 5.

    Framework

    1. (a)

      The Functional Outline of Framework

      Framework works with Tensor Boost Engine to generate executable offline model for neural network. Before the neural network execution, Framework is tightly bounded with Ascend AI processor to generate a high-performance offline model with hardware matching, and Matrix and Runtime are connected to make a deep fusion of the offline model and Ascend AI processor. In the execution of the neural network, Framework combines Matrix, Runtime, Task Scheduler and the underlying hardware resources, with the integration of the offline model, data and Da Vinci Architecture, to optimize the execution process and obtain the application output of the neural network.

      Framework consists of three parts: Offline Model Generator (OMG), Offline Model Executor (OME) and AI Model Manager, as shown in Fig. 6.19.

      Developers use Offline Model Generator to generate the offline model and save it with a “. om” extension. Then, Matrix in the software stack calls AI Model Manager in Framework, starts Offline Model Executor, loads the offline model onto Ascend AI processor, and finally completes the execution of the offline model through the whole software stack. From the birth of offline model, to the loading into the hardware of Ascend AI processor, until the final function runs, Offline Framework always plays an administrative role.

    2. (b)

      The Offline Model Generated by Offline Model Generator

      Taking Convolutional Neural Network (CNN) as an example, the corresponding network model is constructed under the deep learning framework, and the original data is well trained. Then Offline Model Generator is used to conduct operator scheduling optimization, weight data rearrangement and compression, memory optimization, etc., and finally the optimized offline model is generated. Offline Model Generator is mainly used to generate offline models that can be executed efficiently on Ascend AI processor.

      The working principle of Offline Model Generator is shown in Fig. 6.20. After receiving the original model, Offline Model Generator conduct the process of Convolutional Neural Network model in four steps: model analysis, quantization, compilation and serialization.

      • Model Analysis: In the process of analysis, Offline Model Generator supports the analysis of the original network model under different frameworks, extracts the network structure and weight parameters of the original model, and then the network structure is redefined by the unified intermediate graph (IR graph) through the graph representation. IR graph is composed of computing nodes and data nodes. The computing nodes are composed of TBE operators with different functions, while the data nodes receive different tensor data to provide all kinds of input data for the whole network. IR graph is composed of calculation graph and weight, covering all the information of the original model. IR graph builds a bridge between different deep learning frameworks and the software stack of Ascend AI, which makes the neural network model constructed by the external framework easily transformed into the offline model supported by Ascend AI processor.

      • Quantization: Quantization refers to low-bit quantization of high-precision data, so as to save network memory space, reduce transmission delay and improve operation efficiency. The quantization process is shown in Fig. 6.21.

        IR graph is generated when the analysis is completed. If the model still needs to be quantified, it can be quantified by automatic quantization tools based on the structure and weight of IR graph. In the operator, the weights and biases can be quantized. In the offline model generation process, the quantized weights and biases will be saved in the offline model. In the reasoning calculation, the quantized weights and biases can be used to calculate the input data, and the calibration set is used to train the quantization parameters in the quantization process to ensure the accuracy. If quantization is not needed, the offline model is compiled directly.

        Quantization is divided into two types, i.e., offset quantization and non-offset quantization, which need to output two parameters: Scale and Offset. In the process of data quantization, when the quantization method is specified as non-offset quantization, the data adopts the non-offset quantization to calculate the Scale of the quantized data; if the quantization method is specified as data offset quantization, the data adopts offset quantization to calculate the Scale and Offset of the output data. In the process of weight quantization, due to the high accuracy of weight quantization, non-offset quantization is always used. For example, according to the quantization algorithm, INT8 type quantization of the weight file can output INT8 weight and Scale. In the process of offset quantization, according to the Scale of weight and date, the offset data of FP32 type can be quantized into INT32 type data output.

        When there are higher requirements for model size and performance, one can choose to perform quantization operation. In the process of offline model generation, quantization will convert high-precision data to low-bit data, so that the final offline model is lightweight for the sake of saving network memory space, minimizing transmission delay and improving operation efficiency. In the process of quantization, the model storage size is greatly affected by parameters, so Offline Model Generator focuses on the quantization with parameters such as convolution operator, full connected operator and Convolution Depthwise.

      • Compilation: After model quantization, the model needs to be compiled. The compilation is divided into two parts: operator compilation and model compilation. The operator compilation provides the specific implementation of the operator, and the model compilation aggregates the operator models to generate the offline model structure.

        • Operator Compilation: Operator Compilation is used to generate operators, mainly to generate operator specific offline structures. Operator Generation is divided into three processes: input tensor description, weight data conversion and output tensor description. In the process of input tensor description, the input dimension, memory size and other information of each operator are calculated, and the input data form of the operator is defined in Offline Model Generator. In the weight data conversion, the weight parameters used by the operator are processed by data format (such as FP32 to FP16 conversion), shape conversion (such as fractal rearrangement), data compression and so on. In output tensor description, the output dimension, memory size and other information of the operator are calculated.

          The process of operator generation is shown in Fig. 6.22. In the process of operator generation, the shape of output data needs to be analyzed, determined and described through the interface of TBE operator accelerator library. The data format conversion can also be realized through the interface of TBE operator acceleration library.

          Offline Model Generator receives IR graph generated by the neural network and describes each node in IR graph, analyzing the input and output of each operator one by one. Offline Model Generator analyzes the input data source of the current operator to obtain the operator types directly connected with the current operator in the upper layer, entering the operator library through the interface of TBE operator acceleration library to find the output data description of the source operator. And then the output data information of the source operator is returned to Offline Model Generator as the specific input tensor description of the current operator. Therefore, the description of the input data of the current operator can be obtained by acquaintance of the output information of the source operator.

          If the node in IR graph is not an operator but a data node, the input tensor description is not needed. If the operator has weight data, such as convolution operator and full connected operator, the description and processing of weight data are required. If the input weight data type is FP32, it needs to be converted to FP16 type by Offline Model Generator so as to meet the data type requirements of AI Core. After the type conversion, Offline Model Generator calls the ccTransFilter interface to rearrange the weight data, so that the weight input shape can meet the format requirements of AI Core. After obtaining the fixed format weights, Offline Model Generator calls the ccCompressWeight interface provided by TBE to compress and optimize the weights, so as to reduce the weight memory space and make the model lighter. After the weight data conversion is completed, the weight data meeting the calculation requirements is returned to Offline Model Generator.

          After the weight data conversion is completed, Offline Model Generator also needs to describe the output data information of the operator and determine the output tensor form. For high-level complex operators, such as convolution operator and pooling operator, Offline Model Generator can directly obtain the output tensor information of the operator by combining the input tensor information and weight information of the operator through the calculation interface provided by TBE operator acceleration library. If it is a low-level simple operator, such as addition operator, the output tensor information can be directly determined by the input tensor information of the operator, and finally stored in Offline Model Generator. According to the above operation process, Offline Model Generator traverses all operators in IR Graph of the network, the steps of generating operators are executed in a circular way. Input-output tensors and weight data of all operators are described to complete the offline structure representation of operators, which provides the operator model for the next step of model generation.

        • Model Compilation: After the operator generation is completed in the compilation process, Offline Model Generator also needs to generate the model to obtain the offline structure of the model. Offline Model Generator acquires IR Graph, conducts concurrent scheduling analysis on the operator, splits multiple IR Graph nodes into execution flows, obtaining multiple execution flows composed of operators and data input, which can be regarded as the execution sequence of operators. For nodes without interdependence, they are directly allocated to different execution flows. If the nodes in different execution flows have dependencies, the synchronization between multiple execution flows is carried out through rtEvent synchronization interface. In the case of surplus computing resources in AI Core, multi-execution flow splitting can provide multi-stream scheduling for AI Core, so as to improve the computing performance of network model. However, if there are many parallel processing tasks in AI Core, it will aggravate the degree of resource preemption and worsen the execution performance. By default, single execution flow is adopted to process the network, which can prevent the risk of blocking due to the concurrent execution of multiple tasks.

          At the same time, based on the specific execution relationship of the execution sequence of multiple operators, Offline Model Generator can independently perform operator fusion optimization and memory multiplexing optimization. According to the input and output memory information of the operator, the computational memory multiplexing is carried out, and the related multiplexing information is written into the model and operator description to generate an efficient offline model. The optimization operations can reallocate the computing resources during the execution of multiple operators to minimize the memory occupation. At the same time, frequent memory allocation and release can also be avoided during operation, so that the execution of multiple operators with minimum memory usage and the lowest data migration frequency is implemented, with a better performance and a lower demand for hardware resources.

      • Serialization: The complied offline model is stored in memory and needs to be serialized. The serialization process mainly provides the signature and encryption functions to the model files to further encapsulate and protect the integrity of the offline model. After the serialization process is completed, the offline model can be output from memory to an external file that can be called and executed by a remote Ascend AI processor chip.

  6. 6.

    Digital Vision Pre-processing

    Digital Vision Pre-processing (DVPP) module, as the codec and image conversion module in Ascend AI software stack, plays the auxiliary function of pre-processing for neural network. When the video or image data from the system memory and network enter into the computing resources of Ascend AI processor for calculation, DVPP module needs to be called for format conversion prior to the subsequent neural network processing if the data does not meet the input format, resolution and other requirements specified by Da Vinci Architecture.

    1. (a)

      The Function Architecture of DVPP

      There are six modules for DVPP, i.e., Video Decoding (VDEC), Video Encoding (VENC), JPEG Decoding (JPEGD), JPEG Encoding (JPEG), PNG Decoding (PNGD) and Visual Pre-processing Core (VPC).

      • VDEC provides the video decoding function of H.264/H.265, which can decode the input video stream and output the image. It is often used in the pre-processing of scenes such as video recognition.

      • VNEC provides the output video coding function. For the output data of VPC or the original input YUV format data, VNEC can output the encoding into H.264/H.265 video, which is easy to play and display the video directly.

      • JPEG can decode the JPEG format image, convert the original input JPEG image into YUV data, and pre-process the reasoning input data of neural network.

      • After the completion of JPEG image processing, it is necessary to use JPEG to restore the processed data in JPEG format. JPEG Module is mostly used for neural network inference output data post-processing.

      • When the input picture format is PNG, PNGD needs to be called to decode, for PNGD Module can output the PNG picture in RGB format for reasoning and calculation by Ascend AI processor.

      • VPC provides other functions of image and video processing, such as format conversion (such as YUV/RGB format to YUV420 format conversion), size scaling, clipping and so on.

      The execution flow of the digital vision processing modules is shown in Fig. 6.23, which needs to be completed by the cooperation of Matrix, DVPP, DVPP driver and DVPP dedicated hardware.

      • At the top level of the framework is Matrix, which is responsible for scheduling the function modules in DVPP to process and manage the data flow.

      • DVPP is located in the middle and upper layer of the functional architecture, which provides Matrix with programming interfaces to call the video graphics processing module. Through these interfaces, the relevant parameters of the encoding and decoding module and the visual pre-processing module can be configured.

      • DVPP driver is located in the middle and lower layers of the functional architecture, which is the closest hardware module to DVPP. It is mainly responsible for device management, engine management and driver of engine module group. The driver will allocate the corresponding DVPP hardware engine according to the tasks issued by DVPP. At the same time, it also read and write the registers in the hardware module to complete some other hardware initialization work.

      • The bottom layer is the real hardware computing resource, DVPP module group, which is a special accelerator independent of other modules in Ascend AI processor. It is specially designed for the encoding, decoding and pre-processing tasks corresponding to images and videos.

    2. (b)

      The Pre-processing Mechanism of DVPP.

      When the input data enters the data engine, once the engine finds that the data format does not meet the processing requirements of the subsequent AI Core, DVPP can be started for data pre-processing.

      Taking image pre-processing as an example to describe the whole pre-processing process.

      • First, Matrix moves the data from Memory to Buffer of DVPP for caching.

      • According to the specific data format, the pre-processing engine completes the parameter configuration and data transmission through the programming interface provided by DVPP.

      • After the programming interface is started, DVPP transfers the configuration parameters and raw data to the driver, and DVPP driver calls PNG or JPEG for initialization and task distribution.

      • PNG or JPEG of DVPP special hardware starts the actual operation to complete the decoding of the picture so as to obtain YUV or RGB format data to meet the needs of subsequent processing.

      • After decoding, Matrix continues to call VPC with the same mechanism to further convert the image into YUV420SP format. Because YUV420SP format has higher data storage efficiency and occupies less bandwidth, it can transmit more data under the same bandwidth to meet the demand of powerful computing throughput of AI Core. At the same time, DVPP can also complete the image clipping and scaling. Figure 6.24 shows a typical clipping and zeros-padding operation to change the image size. VPC takes out the part of the primitive image to be processed, and then performs zerofilling on this part, so as to retain the edge feature information in the calculation process of CVNN. The zeros-padding operation needs four filling sizes, i.e., top, bottom, left and right. The edge of the image is expanded in the filling area. Finally, the filled image can be calculated directly.

      • After a series of pre-processing, the image data can be processed in the following two ways.

        • The image data can be further pre-processed by AIPP (AI Pre-processing) according to the requirements of the model (Optional. If the output data of DVPP meets the image requirements, it needn’t be processed by AIPP), and then the image data that meets the requirements will enter the AI Core under the control of AI CPU for the required neural network calculation.

        • The output image data is encoded by JPEG, and the post-processing is completed. The data is put into the buffer of DVPP. Finally, Matrix takes out the data for subsequent operation. At the same time, the computing resources of DVPP are released and the cache is recovered.

In the whole pre-processing process, Matrix completes the function call of different modules. As a customized data supply module, DVPP uses heterogeneous or special processing methods to convert image data quickly, providing sufficient data sources for AI Core, thus meeting the needs of large amount of data and large bandwidth in neural network computing.

Fig. 6.8
figure 8

The structure of TBE

Fig. 6.9
figure 9

The three application scenarios of TBE

Fig. 6.10
figure 10

Flow chart of compute engine in deep neural network application

Fig. 6.11
figure 11

PCIe accelerator

Fig. 6.12
figure 12

The computing process of the offline model reasoned through matrix

Fig. 6.13
figure 13

Atlas 200 developer kit (Atlas 200 DK)

Fig. 6.14
figure 14

Logical architecture of Atlas 200 Developer Kit

Fig. 6.15
figure 15

The functional framework of Task Scheduler

Fig. 6.16
figure 16

The collaboration of Runtime and Task Scheduler

Fig. 6.17
figure 17

The context of Runtime

Fig. 6.18
figure 18

Various calling interface provided by Runtime

Fig. 6.19
figure 19

Offline model functional framework

Fig. 6.20
figure 20

The working principle of offline model generator

Fig. 6.21
figure 21

The quantization process

Fig. 6.22
figure 22

The process of operator generation

Fig. 6.23
figure 23

The execution flow of digital vision processing modules

Fig. 6.24
figure 24

Data flow of image pre-processing

6.2.4 The Data Flow of Ascend AI Processor

Taking the reasoning application of face recognition as an example, the data flow of Ascend AI Processor (Ascend 310) is introduced. First, the data is collected and processed by Camera, then the data is reasoned, and finally the face recognition results are output, as shown in Fig. 6.25.

Fig. 6.25
figure 25

The data flow of Ascend AI processor (Ascend 310)

  1. 1.

    The data is collected and processed by Camera.

    • Step 1: The compressed video stream is passed from Camera, and the data is stored in DDR through PCIe channel.

    • Step 2: DVPP reads the compressed video stream into the cache.

    • Step 3: After pre-processing, DVPP writes the decompressed frame into DDR Memory.

  2. 2.

    Reasoning on data.

    • Step 4: Task Scheduler sends commands to Direct Memory Access (DMA) to preload AI resources from DDR to On-chip Buffer.

    • Step 5: Task Scheduler configures AI Core to execute tasks.

    • Step 6: When AI Core works, it reads feature graph and weight, and writes the result into DDR or On-chip Buffer.

  3. 3.

    The results of face recognition are output.

    • Step 7: After AI Core completes the processing, it sends a signal to Task Scheduler. Then Task Scheduler checks the result and assigns another task if necessary, and returns to step 4.

    • Step 8: When the last AI task is completed, Task Scheduler reports the result to Host.

6.3 Atlas AI Computing Solution

Based on Huawei Ascend AI processor, Huawei Atlas AI computing solution builds a full-scene AI Infrastructure Solution oriented to end, edge and cloud through a variety of product forms such as modules, boards, small stations, servers and clusters. This section mainly introduces the corresponding products of Huawei Atlas AI computing solution, including reasoning and training. Reasoning products mainly include Atlas 200 AI acceleration module, Atlas 200 DK, Atlas 300I reasoning card, Atlas 500 intelligent station and Atlas 800 reasoning server, all of which use Ascend 310 AI processor. Training products mainly include Atlas 300T training card, Atlas 800 training server and Atlas 900 AI cluster, all of which use Ascend 910 AI processor. The landscape of Atlas AI computing solution is shown in Fig. 6.26.

Fig. 6.26
figure 26

The landscape of Atlas AI computing solution

  1. 1.

    Atlas 200 AI Acceleration Module

    Atlas 200 AI Acceleration Module is an AI intelligent computing module with high-performance and low-power consumption. It can be deployed on cameras, UAV (unmanned aerial vehicles), robots and other devices for its size half of a credit card, with 9.5 W power consumption, supporting 16 channel real-time HD video analysis.

    Atlas 200 is integrated with Ascend 310 AI processor, which can realize image, video and other data analysis and inference. It can be widely used in intelligent monitoring, robot, UAV, video server and other scenes. The system block diagram of Atlas 200 is shown in Fig. 6.27.

    The following is the performance characteristics of Atlas 200.

    1. (a)

      Powered by Huawei Ascend 310 AI processor, Atlas 200 can provide the multiplication and addition computing power of 16TOPS INT8 or 8TOPS FP16.

    2. (b)

      With rich interfaces, Atlas 200 supports PCIe3.0x4, RGMII, USB2.0/USB3.0, I2C, SPI, UART and etc.

    3. (c)

      Atlas 200 can achieve video access up to 16 channels of 1080p 30fps.

    4. (d)

      Atlas 200 supports various specifications of H.264 and H.265 video codec, which can meet different video processing requirements of users.

  2. 2.

    Atlas 200 DK

    Atlas 200 Developer Kit (Atlas 200 DK) is a kind of product with the core of Atlas 200 AI acceleration module.

    Atlas 200 DK can help AI application developers get familiar with the development environment quickly. Its main function is to open up the core functions of Ascend 310 AI processor through the peripheral interface on the board, so that users can quickly and easily access to the powerful processing power of Ascend 310 AI processor.

    Atlas 200 DK mainly includes Atlas 200 AI acceleration module, image/audio interface chip (Hi3559C) and LAN Switch. The system architecture is shown in Fig. 6.28.

    The performance characteristics of Atlas 200 DK are as follows.

    1. (a)

      Atlas 200 DK provides peak computing power of 16TOPS (INT8).

    2. (b)

      Atlas 200 DK supports two camera inputs, two ISP image processing, and HDR10 high dynamic range technical standard.

    3. (c)

      Atlas 200 DK, matched with strong computing power, supports 1000 MB Ethernet, providing high-speed internet access.

    4. (d)

      Atlas 200 DK provides a universal 40 pin extension interface (reserved) to facilitate product prototype design.

    5. (e)

      Atlas 200 DK supports a wide range DC power input from 5 to 28 V.

    The product specifications of Atlas 200 DK are shown in Table 6.1.

    Advantages of Atlas 200 DK: For developers, a development environment can be built by using a laptop computer, for the cost of a local independent environment is extremely low, and the multi-function and multi-interface can meet the basic requirements. For researchers, the collaborative mode of local development plus cloud training is adopted to build the environment, for Huawei Cloud and Atlas 200 DK adopt a set of protocol stack, cloud training and local deployment without any modification. For entrepreneurs, it provides a code-level prototype (Demo), based on the reference architecture, and modifies 10% of the code to complete the algorithm function. Not only interaction between developers but also community and seamless migration of commercial products can be achieved.

  3. 3.

    Atlas 300I Inference Card

    Huawei Atlas 300I Inference Card is the highest density 64 channel video inference AI accelerator card in the industry, including 3000 and 3010 models, namely Huawei Atlas 300 AI accelerator card of model 3000 and Huawei Atlas 300 AI accelerator card of Model 3010. The difference between the two models is mainly for different architectures (such as x86, ARM, etc.). Here only Huawei Atlas 300 AI accelerator card (Model 3000) is introduced. Huawei Atlas 300 AI accelerator card (Model 3000) is designed and developed based on Ascend 310 AI processor, which adopts four PCI HHHL card of Ascend 310 AI processor, and cooperates with main equipment (such as Huawei Taishan Server) to achieve fast and efficient inference, such as image classification, target detection, etc. The system architecture of Huawei Atlas 300 AI accelerator card (Model 3000) is shown in Fig. 6.29.

    Atlas 300 AI accelerator card (Model 3000) is used in video analysis, OCR, speech recognition, precision marketing, medical image analysis and other scenes.

    Atlas 300 AI accelerator card (Model 3000) is typically used in face recognition system. The system mainly uses face detection algorithm, face tracking algorithm, face quality scoring algorithm and high-speed face contrast recognition algorithm to achieve real-time face snapshot modeling, real-time blacklist contrast alarm and face background retrieval and other functions.

    The architecture of face recognition system is shown in Fig. 6.30. The main components include front-end HD webcam or face snapshot machine, media stream memory server (optional), face intelligent analysis server, face comparison search server, central management server, client management software, etc. Atlas 300 AI accelerator card (Model 3000) is deployed in face intelligent analysis server, mainly to achieve video decoding/pre-processing, face detection, face alignment (correction) and face feature extraction and inference.

    The product specifications of Atlas 300 AI accelerator card (Model 3000) are shown in Table 6.2.

    The key features of Atlas 300 AI accelerator card (Model 3000): supporting the standard interface of half height and half length (single slot) of PCIe 3.0 × 16 HHHL; with maximum power consumption of 67 W; supporting power consumption monitoring and out-of-band management functions; supporting hardware H.264 and H.265 video compression and decompression.

  4. 4.

    Atlas 500 Intelligent Edge Station

    Atlas 500 Intelligent Edge Station is divided into two models, Model 3000 and Model 3010, which are designed for different CPU architectures. Here are some common features of Atlas 500 Intelligent Edge Station. It is Huawei’s lightweight edge device for a wide range of edge application scenarios, which is characterized with super computing performance, large capacity storage, flexible configuration, small size, wide temperature ranges, strong fitness to surroundings, easy maintenance and management, etc.

    Atlas 500 Intelligent Edge Station is a powerful edge computing product that can perform real-time processing on edge devices. A single Atlas 500 Intelligent Edge Station can provide 16 TOPS INT8 processing power with extremely low power consumption. Atlas 500 Intelligent Edge Station integrates Wi-i and LTE wireless data interfaces to provide flexible network access and data transmission solutions.

    Atlas 500 Intelligent Edge Station is the pioneer in the industry to apply Thermo-electric Cooling (TEC) semiconductor cooling technology in large-scale edge computing products, adaptive to harsh The logical architecture of Atlas 500 Intelligent Edge Station is shown in Fig. 6.31.

    Atlas 500 Intelligent Edge Station features the ease of use of edge scenes and 16-channel video analysis and storage capacity.

    1. (a)

      The usability of edge scene mainly includes the following aspects.

      • Real-time: It provides real-time responses in the process of data.

      • Low bandwidth: Only the necessary information is sent to Cloud.

      • Privacy protection: Customers can decide what information they want to send to Cloud and keep locally. All information sent to Cloud can be encrypted.

      • It supports standard container engine, third-party algorithm and application rapid deployment.

    2. (b)

      The 16-channel video analysis and storage capacity mainly includes the following aspects.

      • It supports 16-channel video analysis capability (maximum 16-channel 1080P decoding, 16T INT8 computing power).

      • It supports 12 TB memory capacity, and the video of 16-channels 1080P@4 MB Data Rate can be cached for 7 days while the video of 8-channel 1080P@4 MB Data Rate can be cached for 30 days.

    Atlas 500 Intelligent Edge Station is mainly applied in intelligent video monitoring, analysis, data storage and other scenes, including safe city, intelligent transportation, intelligent community, environmental monitoring, intelligent manufacturing, intelligent care, self-service retail, intelligent buildings, etc. It can be widely deployed in various edge devices and central computer rooms to meet the needs of public security, communities, parks, shopping malls, supermarkets and other complex environmental areas Use, as shown in Fig. 6.32. In these application scenarios, the typical architecture of atlas 500 Intelligent Edge Station is as follows: Terminal, which is connected to IPC (IP Camera) or other front-end devices through wireless or wired; Edge, which achieves value information extraction, storage and upload; Cloud, via which data center model is push, managed, developed and applied, as shown in Fig. 6.33.

    The product specifications of Atlas 500 Intelligent Edge Station are shown in Table 6.3.

  5. 5.

    Atlas 800 Inference Server

    Atlas 800 Inference Server is divided into two models: Model 3000 and Model 3010.

    1. (a)

      Atlas 800 Inference Server (Model 3000)

      Atlas 800 Inference Server (Model 3000) is a data center inference server based on Huawei Kunpeng 920 processor. It can support eight atlas 300 AI accelerator cards (Model 3000), providing powerful real-time reasoning ability, and is widely used in AI inference scenarios. The server is oriented to the Internet, distributed storage, cloud computing, big data, enterprise business and other fields, with the advantages of high-performance computing, large capacity storage, low energy consumption, easy management, easy deployment and so on.

      The performance characteristics of Atlas 800 Inference Server (Model 3000) are as follows.

      • Atlas 800 Inference Server (Model 3000) supports the 64 bits high-performance multi-core Kunpeng 920 processor developed by Huawei and oriented to the server field. It integrates DDR4, PCIe4.0, 25GE, 10GE, GE and other interfaces, and provides complete SOC functions.

        • It can most support Atlas 300 AI accelerator cards (Model 3000), providing powerful real-time reasoning capability.

        • It can most support 64 cores and 3.0 GHz frequency. It also supports a variety of core number and frequency models.

        • It is compatible with ARMv8-A architecture features, supporting ARMv8.1 and ARMv8.2 extension.

        • Core is a self-developed 64 bits-TaiShan core.

      The performance characteristics of Atlas 800 Inference Server (Model 3000) are as follows.

      • Atlas 800 Inference Server (Model 3000) supports the 64 bits high-performance multi-core Kunpeng 920 processor developed by Huawei and oriented to the server field. It integrates DDR4, PCIe4.0, 25GE, 10GE, GE and other interfaces, and provides complete SOC functions.

        • It can most support Atlas 300 AI accelerator cards (Model 3000), providing powerful real-time reasoning capability.

        • It can most support 64 cores and 3.0 GHz frequency. It also supports a variety of core number and frequency models.

        • It is compatible with ARMv8-A architecture features, supporting ARMv8.1 and ARMv8.2 extension.

        • Core is a self-developed 64 bits-TaiShan core.

        • Each core integrates 64 KB L1 ICache, 64 KB L1 DCache and 512 KB L2 DCache.

        • It can most support L3 cache capacity of 45.5 MB ~ 46 Mb.

        • It supports superscalar, variable length, pipeline out of order.

        • It supports ECC 1 bit error correction and ECC 2 bit error reporting.

        • It supports high-speed inter chip Hydra interface with the maximum channel speed of 30 Gbit/s.

        • It supports eight DDR controllers.

        • It supports up to eight physical Ethernet ports.

        • It supports three PCIe controllers, GNE4 (16 Gbit/s) and downward compatibility.

        • It supports IMU maintenance engine to collect CPU status.

      • A single Atlas 800 Inference Server (Model 3000) supports two processors and a maximum of 128 cores, which maximizes the concurrent execution of multi-threaded applications.

      • Atlas 800 Inference Server can most support 32 pieces of 2933 MHZ DDR4 ECC memory, as memory can support RDIMM with the maximum of 4096 GB.

      The logical architecture of Atlas 800 Inference Server (Model 3000) is shown in Fig. 6.34, and its features are as follows.

      • It supports two Huawei self-developed Kunpeng 920 processors, and each processor supports 16 DDR4 DIMMs.

      • CPU1 and CPU2 are interconnected by two Hydra buses, and the transmission rate can reach the maximum of 30 Gbit/s.

      • It supports two kinds of flexible Ethernet interface cards, including 4 × GE and 4 × 25GE, through the high-speed Serdes interface of CPU.

      • Raid control card is connected with CPU1 through the PCIe bus, and connected with the hard disk backplane through SAS signal cable. It can support a variety of local storage specifications through different hard disk backplanes.

      • Baseboard Manager Controller (BMC) uses Hi1710, a management chip developed by Huawei. It can be used for management interfaces such as Video Graphic Array (VGA), management network port and debugging serial port.

      Atlas 800 Inference Server (Model 3000) is an efficient inference platform based on Kunpeng processor. The product specifications are shown in Table 6.4.

    2. (b)

      Atlas 800 Inference Server (Model 3010)

      Atlas 800 Inference Server (Model 3010) is a inference platform based on Intel processor, which is widely used in AI inference scenarios. It supports up to seven Atlas 300 or Nvidia T4 AI accelerators and up to 448 channels of HD video real-time analysis.

      Atlas 800 Inference Server (Model 3010) has many advantages such as low power consumption, great extendibility, high reliability, easy management and deployment, etc.

      The logical architecture of Atlas 800 Inference Server (Model 3010) is shown in Fig. 6.35.

      The features of the Atlas 800 Inference Server (Model 3010) are as follows.

      • It supports one or two Intel xtreme extensible processors.

      • It supports 24 memories.

      • The processors are interconnected by two UltraPath Interconnect (UPI) buses, and the transmission rate can reach the maximum of 10.4 GT/s.

      • The processor is connected with three PCIe Riser cards through the PCIe bus, and supports different specifications of PCIe slots through different PCIe Riser cards.

      • Raid Controller card is connected with CPU1 through the PCIe bus, and connected with the hard disk backplane through the SAS signal cable. It can support a variety of local storage specifications through different hard disk backplanes.

      • Lbg-2 Platform Controller Hub (PCH) is used to support the following two interfaces.

        • It supports two on-board 10GE optical interfaces or two on-board 10GE electrical interfaces through X557 (PHY).

        • It supports two on-board GE interfaces.

      • Hi1710 management chip is used to support outgoing VGA, management network port, debugging serial port and other management interfaces.

      Atlas 800 Inference Server (Model 3010) is a flexible inference platform based on Intel processor, and its product specifications are shown in Table 6.5.

Fig. 6.27
figure 27

The system block diagram of Atlas 200

Fig. 6.28
figure 28

The system architecture of Atlas 200 DK

Table 6.1 The product specifications of Atlas 200 DK
Fig. 6.29
figure 29

The system Architecture of Huawei Atlas 300 AI accelerator card (Model 3000)

Fig. 6.30
figure 30

The architecture of face recognition system

Table 6.2 The product specifications of Atlas 300 AI Accelerator Card (Model 3000)
Fig. 6.31
figure 31

The logical architecture of Atlas 500 Intelligent Edge Station

Fig. 6.32
figure 32

Application scenarios of ATLAS 500 Intelligent Edge Station

Fig. 6.33
figure 33

Typical architecture of ATLAS 500 Intelligent Edge Station

Table 6.3 The product specifications of Atlas 500 Intelligent Edge Station
Fig. 6.34
figure 34

The logical architecture of Atlas 800 Inference Server (Model 3000)

Table 6.4 The product specifications of Atlas 800 Inference Server (Model 3000)
Fig. 6.35
figure 35

The logical architecture of Atlas 800 Inference Server (Model 3010)

Table 6.5 The product specifications of Atlas 800 Inference Server (Model 3010)

6.3.1 Atlas for AI Training Acceleration

  1. 1.

    Atlas 300T Training Card: The Most Powerful AI Training Card

    Huawei Atlas 300T Training Card, namely Huawei Atlas 300 Accelerator (Model 9000), is designed and developed based on Ascend 910 AI processor. It provides a single card up to 256TOPS FP16 AI computing power for data center training scenarios, which is the most powerful AI accelerator in the industry. It can be widely used in various general servers in data centers, providing customers with AI solutions of super performance, high efficiency and low TCO.

    Based on Ascend 910 AI processor, Huawei Atlas 300 Accelerator (Model 9000) has many features as follows.

    1. (a)

      It supports PCIe 4.0 × 16 full-height and 3/4 long standard interface (double slots).

    2. (b)

      The maximum power consumption is 350 W.

    3. (c)

      It supports power consumption monitoring and out-of-band management.

    4. (d)

      It supports video compression or decompression of hardware H.264/H.265.

    5. (e)

      It supports the training framework of Huawei MindSpore and TensorFlow.

    6. (f)

      It supports Linux OS on x86 platform.

    7. (g)

      It supports Linux OS on ARM platform.

    The product specifications of Huawei Atlas 300 Accelerator (Model 9000) are shown in Table 6.6.

    The computing power of Atlas 300 (Model 9000) single card is increased by two times, and the gradient synchronization delay is reduced by 70%. Figure 6.36 shows the speed comparison between the frameworks of mainstream training card + TensorFlow and Huawei Ascend 910 + MindSpore, by using ResNet 50 V1.5 to compare the results on ImageNet 2012 dataset via “Optimal batch size respectively”. It can be seen that the training speed of Huawei Ascend 910 + MindSpore is much faster than the other.

  2. 2.

    Atlas 800 Training Server: The Most Powerful AI Training Server

    Atlas 800 Training Server (Model 9000) is mainly used in AI training scenes with super performance to build an AI computing platform of high efficiency and low power consumption for training scenes. It supports multiple Atlas 300 Accelerators or Acceleration modules, adapting to various video image analysis scenes. It is mainly used in video analysis, deep learning training and other training scenes.

    Atlas 800 Training Server (Model 9000) is based on Ascend 910 processor. Its computing power density is increased by 2.5 times, the hardware decoding capacity is increased by 25 times, and the energy efficiency ratio is increased by 1.8 times.

    Atlas 800 Training Server (Model 9000) has the strongest computing power density, up to 2P FLOPS@FP16/4U Super computing power.

    Atlas 800 Training Server (Model 9000) has flexible configuration and adapts to multiple loads. It supports the flexible configuration of SAS/SATA/NVMe/M.2 SSD combination. It supports on-board network card and flexible IO card, providing a variety of network interfaces.

    The product specifications of Atlas 800 Training Server (Model 9000) are shown in Table 6.7.

  3. 3.

    Atlas 900 AI Cluster: The Fastest AI Training Cluster in the World

    Atlas 900 AI Cluster represents the peak computing power in the world, which is composed of thousands of Ascend 910 AI processors. Through Huawei cluster communication library and job scheduling platform, it integrates three high-speed interfaces, i.e., HCCS, PCIe4.0 and 100 g RoCE, to fully release the powerful performance of Ascend 910 AI processor. The total computing power achieves 256p ~ 1024p FLOPS @FP16, which is equivalent to the computing power of 500,000 PCs. According to the actual measurement, Atlas 900 AI Cluster can complete the training based on ResNet-50 model in 60 s, which is 15% faster than the second place, as shown in Fig. 6.37. It allows researchers to train AI models with images and voices more quickly, so that human beings can more efficiently explore the mysteries of the universe, predict the weather, explore for oil, and accelerate the commercial process of automatic driving.

    The following are key features of Atlas 900 AI Cluster.

    1. (a)

      Leader of the computing power industry: 256–1024 PFLOPS@FP16, thousands of Ascend 910 AI processors are interconnected, providing the fastest ResNet-50@ImageNet performance in the industry.

    2. (b)

      Best cluster network: Three kind of high-speed interfaces, HCCS, PCIe and 100 GB RoCEs, are integrated to vertically integrate communication library, topology and low delay network with its linearity greater than 80%.

    3. (c)

      Super heat dissipation system: Single cabinet 50 KW mixed liquid cooling system supports more than 95% liquid cooling, PUE less than 1.1, saving 79% room space.

    In order to allow all industries to obtain super computing power, Huawei is going to deploy Atlas 900 AI Cluster to Cloud, launch Huawei Cloud EI cluster service, and open its application for global scientific research institutions and universities at a very favorable price.

Table 6.6 The product specifications of Huawei Atlas 300 Accelerator (Model 9000)
Fig. 6.36
figure 36

The speed comparison between Huawei Ascend 910 + MindSpore and the other

Table 6.7 The product specifications of Atlas 800 Training Server (Model 9000)
Fig. 6.37
figure 37

Speed comparison of Atlas 900 AI Cluster and other training clusters

6.3.2 Atlas Device-Edge-Cloud Collaboration

Huawei Atlas AI computing solution has three advantages over general industry solutions, i.e., unified development, unified operation and maintenance, and security upgrade. Generally, different development architectures are used in the edge side and the center side of the industry, so the models need secondary development for they cannot flow freely. Huawei Atlas, based on the unified development architecture of Da Vinci Architecture and Compute Architecture for Neural Networks (CANN), can be used in the terminal, edge and cloud for primary development. Generally, there is no operation and maintenance management tool in the industry, while only API is open, so that customers need to develop it by themselves. However, FusionDirector of Huawei Atlas can manage up to 50,000 nodes to achieve a unified management of central and edge devices, and either model push or device upgrade can be done remotely. The industry generally has no encryption and decryption engine and does not encrypt the model, while Huawei Atlas encrypts both the security of the transmission channel and the model for double protection. AtlasDevice-Edge-Cloud Collaboration, the consistent training of central side, and the remote update of modes are shown in Fig. 6.38.

Fig. 6.38
figure 38

AtlasDevice-Edge-Cloud Collaboration

6.4 Industrial Implementation of Atlas

This section mainly introduces the industry application scenarios of Atlas AI computing solutions, such as applications in power, finance, manufacturing, transportation, supercomputing and other fields.

6.4.1 Electricity: One-Stop ICT Smart Grid Solution

With the increasing dependence of modern society on electricity, the traditional extensive and inefficient way of power utilization can not meet the current demand, and more efficient and reasonable power supply is needed. How to achieve reliable, economic, efficient and green power grid is the biggest challenge for the power industry.

Relying on its leading ICT technology, Huawei, together with its partners, has launched a full range of intelligent business solutions covering all aspects including power generation, transmission, transformation, distribution and utilization. The traditional power system is deeply integrated with cloud computing, big data, Internet of things and mobile technology to achieve the comprehensive perception, interconnection and business intelligence of various power terminals.

For example, the pioneer intelligent unmanned patrol inspection (see Fig. 6.39) replaces the traditional manual patrol inspection, which boosts the operation efficiency by five times and reduces the system cost by 30%. The front-end camera of the intelligent unmanned patrol inspection system is equipped with Huawei Atlas 200 acceleration module, which can quickly analyze problems and send back alarms. At the same time, the remote monitoring and management platform is equipped with Atlas 300 AI training card or Atlas 800 AI server, used for training model, can realize the remote upgrade operation of the model.

Fig. 6.39
figure 39

Intelligent unmanned patrol inspection

6.4.2 Intelligent Finance: Holistic Digital Transformation

Financial technology and digital financial services have become an integral part of the overall lifestyle of residents, which is not limited to payment, but also investment, deposits and loans.

One of Huawei Atlas AI computing solutions for the financial industry is the smart banking outlets, which adopts advanced access solutions, security and all-in-one technology to help customers build a new generation of smart banking outlets.

Huawei Atlas AI computing solution uses AI to change finance, helping banking outlets carry out intelligent transformation. Through the accurate identification of VIP customers, the conversion rate of potential customers increased by 60%; Through the intelligent face-scanning authentication, the business processing time dropped by 70%; Through the analysis of customer queuing time, the customer complaints dropped by 50%, as shown in Fig. 6.40.

Fig. 6.40
figure 40

Intelligent finance—intelligent transformation of bank branches

6.4.3 Intelligent Manufacturing: Digital Integration of Machines and Thoughts

In the era of industry 4.0, the deep integration of the new generation of information technology and manufacturing industry has brought far-reaching industrial changes. Mass customization, global collaborative design, intelligent factories based on Cyber Physical System (CPS) and Internet of Vehicles are reshaping the industrial value chain, forming new production modes, industrial forms, business models and economic growth points. Based on cloud computing, big data, Internet of Things (IoT) and other technologies, Huawei joins hands with global partners to help customers in the manufacturing industry reshape the value chain of manufacturing industry, innovate business models and achieve new value creation.

Huawei Atlas AI computing solution helps upgrade the intelligence of production line, using machine vision technology for intelligent detection instead of traditional manual detection. The “unstable results, low production efficiency, discontinuous process and high labor cost” of manual detection is transformed into “no missed detection, high production efficiency, cloud-edge collaboration and labor saving” of intelligent detection, as shown in Fig. 6.41.

Fig. 6.41
figure 41

Cloud-edge collaboration, intelligent detection

6.4.4 Intelligent Transportation: Easier Travel and Improved Logistics

With the acceleration of globalization and urbanization, the demand for transportation is growing day by day, which greatly drives the construction demand of green, safe, efficient and smooth modern integrated transportation system. Adheres to the concept of “Easier Travel and Improved Logistics”, Huawei is committed to providing customers with innovative solutions such as digital railway, digital urban rail transit and smart airport. Through new ICT technologies such as cloud computing, big data, Internet of things, agile network, BYOD, eLTE and GSM-R, Huawei improves the information level of the industry and helps industry customers to raise transportation service level, ensuring an easier travel, an improved logistics, a smoother and safer traffic. Huawei Atlas AI computing solution helps to upgrade the national expressway network, so that the traffic efficiency increased by five times though cooperative vehicle infrastructure, as shown in Fig. 6.42.

Fig. 6.42
figure 42

Cooperative vehicle infrastructure, improved traffic efficiency

6.4.5 Super Computer: Building State-Level AI Platform

Pengcheng Yunnao II is mainly built on Atlas 900, the fastest training cluster in the world, with the strongest computing power (E-level AI computing power), the best cluster network (HCCL collective communication supports 100 TB non-blocking parameter area network), and the ultimate energy efficiency (AI cluster PUE < 1.1). Atlas helps Pengcheng Yunnao II to set up Pengcheng laboratory, which is the basic innovation platform to realize the national mission, as shown in Fig. 6.43.

Fig. 6.43
figure 43

Pengcheng laboratory

6.5 Chapter Summary

This chapter focuses on Huawei Ascend AI processor and Atlas AI computing solution. Firstly, the hardware structure and software structure of Ascend AI processor are introduced, and then the related inference products and training products of Atlas AI computing solution are introduced. Finally, the industry application scenarios of Atlas is introduced.

6.6 Exercises

  1. 1.

    As two processors for AI computing, what is the difference between CPU and GPU?

  2. 2.

    Da Vinci Architecture is specially developed for boosting AI computing power. It is not only the engine of Ascend AI computing, but also the core of Ascend AI processor. What are the three main components of Da Vinci Architecture?

  3. 3.

    What are the three kinds of basic computing resources contained in the computing unit of Da Vinci Architecture?

  4. 4.

    The software stack of Ascend AI processor is mainly divided into four levels and one auxiliary tool chain. What are the four levels? What auxiliary capabilities does the tool chain provide?

  5. 5.

    The neural network software flow of Ascend AI processor is a bridge between deep learning framework and Ascend AI processor. It provides a shortcut for neural network to transform from original model, to intermediate computing graph representation, and then to independent offline model. The neural network software flow of Ascend AI processor mainly completes the generation, loading and execution of the neural network application offline model. What are the main function modules contained in it?

  6. 6.

    Ascend AI processors are divided into two types, i.e., Ascend 310 and Ascend 910, which are both based on Da Vinci Architecture. But they are different in precision, power consumption and process. What are the differences in their application fields?

  7. 7.

    The corresponding products of Atlas AI computing solutions mainly include inference and training. What are the reasoning and training products?

  8. 8.

    Illustrate the application scenario of Atlas AI computing solution.