Performance Accelerators

Wang, Liang; Zhao, Jianxin

doi:10.1007/978-1-4842-8853-5_7

Liang Wang³ &
Jianxin Zhao⁴

4243 Accesses

Abstract

The Graphics Processing Unit (GPU) has become one of the most important types of hardware accelerators. It is designed to render 3D graphics and videos and still is core to the gaming industry. Besides creating stunning visual effects, programmers also take advantage of the GPU’s advantage in parallel processing in many fields to perform computing-heavy tasks, such as in health data analytics, physical simulation, artificial intelligence, etc.

You have full access to this open access chapter, Download chapter PDF

7.1 Hardware Accelerators

The Graphics Processing Unit (GPU) has become one of the most important types of hardware accelerators. It is designed to render 3D graphics and videos and still is core to the gaming industry. Besides creating stunning visual effects, programmers also take advantage of the GPU’s advantage in parallel processing in many fields to perform computing-heavy tasks, such as in health data analytics, physical simulation, artificial intelligence, etc.

Recall from Chapter 2 the architecture of a typical CPU. The architecture of a GPU core is somewhat similar. Figure 7-1 shows one core of an Nvidia GTX 1080, which contains 20 such cores. Compared with a CPU core, it contains much more multithreading units, including more single-instruction-multiple-data (SIMD) function units, and more cores in a GPU. Another character of the GPU is its small cache. This GPU contains only two levels of caches (this figure only shows the level 1 cache; the level 2 cache is shared by all 20 cores), each smaller than a typical CPU’s. Besides, the GPU also focuses on throughput, and thus its bandwidth between cores and memory is much larger than that of a CPU.

A diagram of the architecture of a G P U core with 3 layers. The top layer contains 4 modules, with 32 squares each and with 2 fetches or decode boxes. Execution content and L 1 cache are the second and third layers. — **Figure 7-1**

Another type of accelerator that has gain much attention in recent years is the Tensor Processing Unit (TPU). In 2016, Google has announced TPU, its application-specific integrated circuit, and introduced that TPU has been deployed in Google’s data centers to accelerate neural network computation.

The design of TPU is based on the fact that matrix multiplication plays a dominant role in neural network–related computing and uses this operation as a primitive operation. In the CPU, the basic calculation unit is scalar, so that we can, for example, add two integers using one single instruction within a cycle. The GPU, on the other hand, widely utilizes multithreading, and thus the user can add two vectors in a cycle. The TPU further proposes to finish a matrix operation in one cycle.

A TPU v2 core consists of a Matrix Multiply Unit (MXU) and a Vector Processing Unit (VPU). The former specializes in matrix multiplication, and the latter takes care of all other types of tasks, such as activations. The MXU utilizes a systolic array architecture to enable the single-clock matrix operation. As a result, it can execute up to 128K operations in one cycle. Google reports that the TPU delivered 15–30X higher performance and 30–80X higher performance per watt than contemporary CPUs and GPUs. Besides, due to neural networks’ tolerance to errors, TPU also performs quantization to compress calculation by converting continuous float numbers to discrete ones.

7.1.1 Utilizing Accelerators

There is no doubt that numerical computing heavily relies on hardware accelerators: TensorFlow, PyTorch, Julia, MATLAB, etc. They all support multiple types of devices, including at least the CPU and GPU. In general, there are two methods to do that.

The first, and most widely used, is direct support of the hardware. Take the GPU as an example. When programming a GPU, Nvidia CUDA is a widely used choice. CUDA is a parallel computing platform and programming model for computing on Nvidia GPUs. In TensorFlow, a computation graph is first expressed in Python on the frontend and is then accordingly built up using the C++ backend. This graph is further optimized and partitioned onto multiple devices, which can be CPU, GPU, or TPU devices. Each device invokes the corresponding executor to run the assigned computation on its subgraph. For TPU device execution, TensorFlow incorporates a compiler and software stack that translates API calls from TensorFlow computation graphs into TPU instructions. In Julia, support for Nvidia GPUs is provided by its CUDA.jl package. Built on the CUDA toolkit, it enables both interfacing with the CUDA API directly and writing CUDA kernels. NumPy does not support GPUs, but in the vast Python world, there are a lot of GPU-friendly alternatives, such as Numba, CuPy, etc.

Compared with CUDA, the Open Computing Language (OpenCL) serves as an open source standard for cross-platform parallel programming and is not limited to Nvidia GPUs. Therefore, some numerical libraries and software also support it to work on non-Nvidia GPUs. The StreamExecutor that TensorFlow utilizes to process computation tasks on a GPU device is actually a unified wrapper around the CUDA and OpenCL runtimes.

Recently, given a growing number of deep learning frameworks and equally growing number of hardware accelerator platforms, a new approach is to utilize intermediate representations. For example, the deep learning compiler has gained rapid growth. A DL compiler takes the model definition described in a deep learning framework and generates efficient implementation specific to certain target hardware. TVM [10] is one popular DL compiler that works with a wide range of frameworks and hardware devices. A closely related idea is an open neural network standard that can be converted to and from various frameworks and can also be compiled and executed on various hardware. One such example is the Open Neural Network Exchange (ONNX) format. In summary, it is a growing trend that the definition of computation can be separated out and the low-level compilers to deal with optimization, code generation, etc. to pursue best computation performance. We can think of DL compilers and open standards as the neck of an hourglass that bridges the gap between two types of ecosystems. In the rest of this chapter, based on the latter approach, we propose owl_symbolic, which converts Owl computation to that of ONNX and can further be executed on various hardware accelerators.

7.2 Design

Except for the requirement to be executed on accelerators, the development of the owl_symbolic library is motivated by several other factors. For one thing, scientific computation can be considered as consisting of two broad categories: numerical computation and symbolic computation. Owl has achieved a solid foundation in the former, but as yet to support the latter one, which is heavily utilized in a lot of fields.

Besides, tasks such as visualizing a computation also require some form of intermediate representation (IR). Owl has already provided a computation graph layer to separate the definition and execution of computation to improve the performance, as introduced in Chapter 6, but it’s not an IR layer to perform these different tasks as mentioned before. Toward this end, we begin to develop an intermediate symbolic representation of computations and facilitate various tasks based on this symbol representation.

One thing to note is that do not mistake our symbolic representation as the classic symbolic computation (or computer algebra system) that manipulates mathematical expressions in a symbolic way, which is similar to the traditional manual computations. It is indeed one of our core motivations to pursue the symbolic computation with Owl. Currently, we provide a symbolic representation layer as the first step toward that target. More discussion will be added in future versions of the development with the support of symbolic math in Owl.

The owl_symbolic library is divided into two parts: the core symbolic representation that constructs a symbolic graph and various engines that perform different tasks based on the graph. The architecture design of this system is shown in Figure 7-2.

An image depicts the architecture of the symbolic system. The top layer has the 3 engines, namely, O N N X, Owl, and La TeX, which are 2 ways related to the symbolic representation layer. — **Figure 7-2**

The core abstraction is an independent symbolic representation layer. Based on this layer, we have built various engines that can be translated to and from this symbolic representation. Currently, we support three engines: the ONNX binary format, the computation graph in Owl, and the LaTeX string. The CAS engine is currently still an ongoing research project, and we envision that, once finished, this engine can be used to preprocess a symbolic representation so that it has a simplified canonical form before being processed by other engines.

7.2.1 Core Abstraction

The core part is designed to be minimal and contains only necessary information. Currently, it has already covered many common computation types, such as math operations, tensor manipulations, neural network–specific operations such as convolution, pooling, etc. Each symbol in the symbolic graph performs a certain operation. Input to a symbolic graph can be constants such as integer, float number, complex number, and tensor. The input can also be variables with certain shapes. An empty shape indicates a scalar value. The users can then provide values to the variable after the symbolic graph is constructed.

Symbol

The symbolic representation is defined mainly as an array of symbol. Each symbol is a graph node that has an attribution of type Owl_symbolic_symbol.t. It means that we can traverse through the whole graph by starting with one symbol. Besides symbols, the name field is the graph name, and node_names contains all the nodes’ names contained in this graph.

type symbol = Owl_symbolic_symbol.t Owl_graph.node type t = { mutable sym_nodes : symbol array ; mutable name : string ; mutable node_names : string array }

Let’s look at Owl_symbolic_symbol.t. It defines all the operations contained in the symbolic representation:

There are totally about 150 operations included in our symbolic representation. Each operation is implemented as a module. These modules share common attributes such as names, input operation names, and output shapes, and then each module contains zero or more attributes of itself. For example, the Sin operation module is implemented as

module Sin = struct type t = { mutable name : string ; mutable input : string array ; mutable out_shape : int array option array } let op_type = "Sin" let create ?name x_name = let input = [| x_name |] in let name = Owl_symbolic_utils.node_name ?name op_type in { name; input; out_shape = [| None |] } end

The module provides properties such as op_type and functions such as create that returns objects of type Sin.t. The name, input, and out_shape are common attributes in the operation modules.

In implementing the supported operations, we follow the categorization used in ONNX. These operations can be generally divided into different groups as follows:

Generators: Operations that generate data, taking no input. For example, the Int, Float, Tensor, Variable, etc.
Logical: Logical operations such as Xor.
Math: Mathematical operations. This group of operations makes a large part of the total operations supported.
Neural network: Neural network–related operations such as convolution and pooling.
Object detection: Also used in neural networks, but the operations that are closely related with object detection applications, including RoiAlign and NonMaxSuppression.
Reduction: Reduction (or folding) math operations such as sum reduce.
RNN: Recurrent neural network–related operations such as LTSM.
Tensor: Normal tensor operations, like the ones that are included in the Ndarray module, such as concat, reshape, etc.
Sequence: Take multiple tensors as one single object called sequence, and there are different corresponding functions on the sequence type data, such as SequenceInsert, SequenceLength, etc.

Based on these operation modules, we provide several functions on the Owl_symbolic_symbol.t type:

name: Get the name of the operation
op_type: Get the operation type string
input: Get the input node name of an operation
set_input: Update the input node name
output: Get the output node name
set_output: Update the output node name

There are also some functions that only apply to certain types of operations. The generator type of operations all need to specify the type of data it supports. Therefore, we use the dtype function to check their data types. Another example is the output property. For most of the operation, it has only one output, and therefore its name is its output name. However, for operations such as MaxPool that contains multiple outputs, we need another function: output.

Type Checking

The type supported by owl_symbolic is listed as follows:

This list of types covers most number and non-number types. Besides, the SNT_SEQ type is used to compose with these basic types to indicate a list of float number, boolean value, string, etc.

Operators

All these operations are invisible to users. What the users really use are the operators. To build a graph, we first need to build the required attributes into an operation and then put it into a graph node. This is what an operator does. Take the sin operator as an example:

let sin ?name x = let xn = Owl_symbolic_graph.name x in let s = Owl_symbolic_ops_math.Sin.create ?name xn in make_node (Owl_symbolic_symbol.Sin s) [| x |]

Here, the sin operator takes its parent node x as an input, get its name as an input property, and create a symbol node with the function make_node. This function takes an operation and an array of parent symbols and then creates one symbol as a return. What it does is mainly creating a child node using the given operation as node attribution, updating the child’s input and output shapes, and then connecting the child with parents before returning the child node. The connection is on both directions:

Therefore, the users can use the operators to build a graph representation. Here is an example:

open Owl_symbolic open Op open Infix let x = variable "x_0" let y = exp ((sin x ** float 2.) + (cos x ** float 2.)) + (float 10. * (x ** float 2.)) + exp (pi () * complex 0. 1.)

Here, we start with the variable operator, which creates a placeholder for incoming data later. You can specify the shape of the variable with the ~shape parameter. If not specified, then it defaults to a scalar. You can also choose to initialize this variable with a tensor so that even if you don’t feed any data to the variable, the default tensor value will be used. A tensor in owl-symbolic is defined as

type tensor = { mutable dtype : number_type ; mutable shape : int array ; mutable str_val : string array option ; mutable flt_val : float array option ; mutable int_val : int array option ; mutable raw_val : bytes option }

A tensor is of a specific type of data, and then it contains the value: string array, float array, integer array, or bytes. Only one of these fields can be used. If initialized with a tensor, a variable takes the same data type and shape as that of the tensor.

Naming

Currently, we adopt a global naming scheme, which is to add an incremental index number after each node’s type. For example, if we have an Add symbol, a Div symbol, and then another Add symbol in a graph, then each node will be named add_0, div_1, and add_1. One exception is the variable, where a user has to explicitly name when creating a variable. Of course, users can also optionally name any node in the graph, but the system will check to make sure the name of each node is unique. The symbolic graph contains the node_names field that includes all the nodes’ names in the graph.

Shape Inferencing

One task the symbolic core needs to perform is shape checking and shape inferencing. Shape inference is performed in the make_node function and therefore happens every time a user uses an operation to construct a symbolic node and connect it with previous nodes. It is assumed that the parents of the current node are already known.

let (in_shapes : int array option array array)= Array.map (fun sym_node -> Owl_graph.attr sym_node |> Owl_symbolic_symbol.out_shape ) parents in let (shape : int array option array) = Owl_symbolic_shape.infer_shape in_shapes sym ...

As the code shows, for each node, we first find the output shapes of its parents. The in_shape is of type int array option array array. You can understand it this way: int array is a shape array; int array option means this shape could be None. Then int array option array is one whole input from the previous parent, since one parent may contain multiple outputs. Finally, int array option array array includes output from all parents. The main function Owl_symbolic_shape.infer_shape then infers the output shape of the current node and saves it to the out_shape property of that symbol.

The infer_shape function itself checks the symbol type and then match with specific implementation. For example, a large number of operations actually take one parent and keep its output shape:

This pattern infer_shape_01 covers these operations. It simply takes the input shape and returns the same shape.

There are two possible reasons for the input shape to be None. At first, each node will be initialized with a None output shape. During shape inference, in certain cases, the output shape depends on the runtime content of input nodes, not just the shapes of input nodes and attributions of the current node. In that case, the output shape is set to None. Once the input shapes contain None, the shape inference results hereafter will all be None, which means the output shapes cannot be decided at compile time.

Multiple Outputs

Most of the operators are straightforward to implement, but some of them return multiple symbols. In that case, an operation returns not a node, but a tuple or, when output numbers are uncertain, an array of nodes. For example, the MaxPool operation returns two outputs: one is the normal maxpooling result, and the other is the corresponding tensor that contains indices of the selected values during pooling. Or we have the Split operation that splits a tensor into a list of tensors, along the specified axis. It returns an array of symbols.

7.2.2 Engines

Based on this simple core abstraction, we use different engines to provide functionalities: converting to and from other computation expression formats, print out to human-readable format, graph optimization, etc. As we have said, the core part is kept minimal. If the engines require information other than what the core provides, each symbol has an attr property as an extension point. All engines must follow the following signature:

type t val of_symbolic : Owl_symbolic_graph.t -> t val to_symbolic : t -> Owl_symbolic_graph.t val save : t -> string -> unit val load : string -> t

It means that each engine has its own core type t, be it a string or another format of graph, and it needs to convert t to and from the core symbolic graph type or save/load a type t data structure to a file. An engine can also contain extra functions besides these four. Now that we have explained the design of owl_symbolic, let’s look at the details of some engines in the next few sections.

7.3 ONNX Engine

The ONNX engine is the current focus of development in owl_symbolic. ONNX is a widely adopted Open Neural Network Exchange format. A neural network model defined in ONNX can be, via suitable converters, run on different frameworks and thus hardware accelerators. The main target of ONNX is to promote the interchangeability of neural network and machine learning models, but it is worth noting that the standard covers a lot of basic operations in scientific computation, such as power, logarithms, trigonometric functions, etc. Therefore, the ONNX engine serves as a good starting point for its coverage of operations.

Taking a symbolic graph as input, how would then the ONNX engine produce the ONNX model? We use the ocaml-protoc, a protobuf compiler for OCaml, as the tool. The ONNX specification is defined in an onnx.proto file, and the ocaml-protoc can compile this protobuf files into OCaml types along with serialization functions for a variety of encodings. For example, the top-level message type in onnx.proto is ModelProto, defined as follows:

message ModelProto { optional int64 ir_version = 1; repeated OperatorSetIdProto opset_import = 8; optional string producer_name = 2; optional string producer_version = 3; optional string domain = 4; optional int64 model_version = 5; optional string doc_string = 6; optional GraphProto graph = 7; repeated StringStringEntryProto metadata_props = 14; };

And the generated OCaml types and serialization function are

open Owl_symbolic_specs.PT type model_proto = { ir_version : int64 option ; opset_import : operator_set_id_proto list ; producer_name : string option ; producer_version : string option ; domain : string option ; model_version : int64 option ; doc_string : string option ; graph : graph_proto option ; metadata_props : string_string_entry_proto list } val encode_model_proto : Onnx_types.model_proto -> Pbrt.Encoder.t -> unit

Besides the meta-information such as model version, IR version, etc., a model is mainly a graph, which includes input/output information and an array of nodes. A node specifies the operator type, input and output node names, and its own attributions, such as the axis attribution in reduction operations.

Therefore, all we need is to build up a model_proto data structure gradually from attributions to nodes, graph, and model. It can then be serialized using encode_model_proto to generate a protobuf format file, and that is the ONNX model we want.

Besides building up the model, one other task to be performed in the engine is type checking and type inferencing. For example, the sine function can only accept an input of float or double number types and generate the same type of input as that of the input. Each type of operator has its own rules of type checking and inferencing. Starting from input nodes, which must contain specific type information, this chain if inferencing can thus verify the whole computation meets the type constraints for each node and then yield the final output types of the whole graph. The reason that type checking is performed at the engine side instead of the core is that each engine may have different type constraints and type inferencing rules for the operators.

7.3.1 Example 1: Basic Operations

Let’s look at several examples of using the ONNX engine, starting with a simple one:

open Owl_symbolic open Op open Infix let x = variable "X" let y = variable "Y" let z = exp ((sin x ** float 2.) + (cos x ** float 2.)) + (float 10. * (y ** float 2.)) let g = SymGraph.make_graph [| z |] "sym_graph" let m = ONNX_Engine.of_symbolic g let _ = ONNX_Engine.save m "test.onnx"

After including necessary library components, the first three lines of code create a symbolic representation z using the symbolic operators such as sin, pow, and float. The x and y are variables that accept user inputs. It is then used to create a symbolic graph. This step mainly checks if there is any duplication of node names. Then the of_symbolic function in the ONNX engine takes the symbolic graph as input and generates a model_proto data structure, which can be further saved as a model named test.onnx.

To use this ONNX model, we could use any framework that supports ONNX. Here, we use the Python-based ONNX Runtime as an example. We prepare a simple Python script as follows:

import numpy as np import math import onnxruntime as rt sess = rt.InferenceSession("test.onnx") input_name_x = sess.get_inputs()[0].name input_name_y = sess.get_inputs()[1].name x = np.asarray(math.pi, dtype="float32") y = np.asarray(3., dtype="float32") pred_onx = sess.run(None, {input_name_x: x, input_name_y: y})[0] print(pred_onx)

This script is very simple: it loads the ONNX model we have just created and then gets the two input variables and assigns two values to them in the sess.run command. All the user needs to know in advance is that there are two input variables in this ONNX model. Note that we could define not only scalar type inputs but also tensor type variables in owl_symbolic and then assign a NumPy array to them when evaluating.

7.3.2 Example 2: Variable Initialization

We can initialize the variables with tensor values so that these default values are used even if no data are passed in. Here is one example:

open Owl_symbolic open Op let _ = let flt_val = [| 1.; 2.; 3.; 4.; 5.; 6. |] in let t = Type.make_tensor ~flt_val [| 2; 3 |] in let x = variable ~init:t "X" in let y = sin x in let g = SymGraph.make_graph [| y |] "sym_graph" in let z = ONNX_Engine.of_symbolic g in ONNX_Engine.save z "test.onnx"

This computation simply takes an input variable x and then applies the sin operation. Let’s look at the Python side:

import numpy as np import onnxruntime as rt sess = rt.InferenceSession("test.onnx") pred_onx = sess.run(None, input_feed={}) print(pred_onx[0])

The expected output is

[[ 0.84147096 0.9092974 0.14112 ] [-0.7568025 -0.9589243 -0.2794155 ]]

Note how the initializer works without users providing any input in the input feed dictionary. Of course, the users can still provide their own data to this computation, but the mechanism may be a bit different. For example, in onnx_runtime, using sess.get_inputs() gives an empty set this time. Instead, you should use get_overridable_initializers():

input_x = sess.get_overridable_initializers()[0] input_name_x = input_x.name input_shape_x = input_x.shape x = np.ones(input_shape_x, dtype="float32") pred_onx = sess.run(None, {input_name_x: x})

7.3.3 Example 3: Neural Network

The main purpose of the ONNX standard is to express neural network models, and we have already covered most of the common operations that are required to construct neural networks. However, to construct a neural network model directly from existing owl_symbolic operations requires a lot of details such as input shapes or creating extra nodes. For example, if we want to build a neural network with operators directly, we need to write something like

let dnn = let x = variable ~shape:[| 100; 3; 32; 32 |] "X" in let t_conv0 = conv ~padding:Type.SAME_UPPER x (random_uniform ~low:(-0.138) ~high:0.138 [| 32; 3; 3; 3 |]) in let t_zero0 = let flt_val = Array.make 32 0. in let t = Type.make_tensor ~flt_val [| 32 |] in tensor t in let t_relu0 = relu (t_conv0 + t_zero0) in let t_maxpool0, _ = maxpool t_relu0 ~padding:VALID ~strides:[| 2; 2 |] [| 2; 2 |] in let t_reshape0 = reshape [| 100; 8192 |] t_maxpool0 in let t_rand0 = random_uniform ~low:(-0.0011) ~high:0.0011 [| 8192; 512 |] in ....

Apparently, that’s too much information for the users to handle. To make things easier for the users, we create a neural network layer based on existing symbolic operations. This lightweight layer takes only 180 LoC, and yet it provides an Owl-like clean syntax for the users to construct neural networks. For example, we can construct an MNIST-DNN model:

open Owl_symbolic_neural_graph let nn = input [| 100; 3; 32; 32 |] |> normalisation |> conv2d [| 32; 3; 3; 3 |] [| 1; 1 |] |> activation Relu |> max_pool2d [| 2; 2 |] [| 2; 2 |] ~padding:VALID |> fully_connected 512 |> activation Relu |> fully_connected 10 |> activation (Softmax 1) |> get_network let _ = let onnx_graph = Owl_symbolic_engine_onnx.of_symbolic nn in Owl_symbolic_engine_onnx.save onnx_graph "test.onnx"

Besides this simple DNN, we have also created the complex architectures such as ResNet, InceptionV3, SqueezeNet, etc. They are all adapted from existing Owl DNN models with only a minor change. The execution of the generated ONNX model is similar:

import numpy as np import onnxruntime as rt sess = rt.InferenceSession("test.onnx") input_name_x = sess.get_inputs()[0].name input_name_shape = sess.get_inputs()[0].shape input_x = np.ones(input_name_shape , dtype="float32") pred_onx = sess.run(None, {input_name_x: input_x})[0]

For simplicity, we generate a dummy input for the execution/inference phase of this model. Of course, currently, in our model the weight data is not trained. Training of a model should be completed on a framework such as TensorFlow. Combining trained weight data into the ONNX model remains to be a future work.

Furthermore, by using tools such as js_of_ocaml, we can convert both examples into JavaScript; executing them can create the ONNX models, which in turn can be executed on the browser using ONNX.js that utilizes WebGL. In summary, using ONNX as the intermediate format for exchange computation across platforms enables numerous promising directions.

7.4 LaTeX Engine

The LaTeX engine takes a symbolic representation as input and produces LaTeX strings which can then be visualized using different tools. Its design is simple, mainly about matching symbol type and projecting it to correct implementation. Again, let’s look at an example that builds up a symbolic representation of a calculation \( \exp \left(\sin {\left({x}_0\right)}^2+\cos {\left({x}_0\right)}^2\right)+10\times {x}_0^2+\exp \left(\pi\ i\right) \):

open Owl_symbolic open Op open Infix let make_expr0 () = let x = variable "x_0" in let y = exp ((sin x ** float 2.) + (cos x ** float 2.)) + (float 10. * (x ** float 2.)) + exp (pi () * complex 0. 1.) in SymGraph.make_graph [| y |] "sym_graph"

This expression can be converted into a corresponding LaTeX string:

# let () = make_expr0 () |> LaTeX_Engine.of_symbolic |> print_endline \exp(\sin(x_0) ^ 2 + \cos(x_0) ^ 2) + 10 \times x_0 ^ 2 + \exp(\pi \times 1.00i)

Simply putting it in the raw string form is not very helpful for visualization. We have built a web UI in this engine that utilizes KaTeX, which renders LaTeX string directly on a browser. In the following, we use the html function provided by the engine to show this string on our web UI using the functionality the engine provides:

# let () = let exprs = [ make_expr0 () ] in LaTeX_Engine.html ~dot:true ~exprs "example.html"

The generated “example.html” web page is a stand-alone page that contains all the required scripts. Once opened in a browser, it looks like Figure 7-3.

For each expression, the web UI contains its rendered LaTeX form and corresponding computation graph.

A figure illustrates the user interface of the Owl symbolic LaTeX engine. It has a Plus computation graph and a copy LaTeX option. exp of the sin of x subscript zero whole square plus cos of x subscript 0 whole square plus 10 into x subscript zero square plus exp of pie cross 1 point 0 0 i. — **Figure 7-3**

7.5 Owl Engine

An Owl engine enables converting an Owl computation graph to or from a symbolic representation. A symbolic graph can thus benefit from the concise syntax and powerful features such as algorithmic differentiation in Owl.

The conversion between Owl CGraph and the symbolic representation is straightforward, since both are graph structures. We only need to focus on making the operation projection between these two systems correct.

let cnode_attr = Owl_graph.attr node in match cnode_attr.op with | Sin -> Owl_symbolic_operator.sin ~name sym_inputs.(0) | Sub -> Owl_symbolic_operator.sub ~name sym_inputs.(0) sym_inputs.(1) | SubScalar -> Owl_symbolic_operator.sub ~name sym_inputs.(0) sym_inputs.(1) | Conv2d (padding, strides) -> let pad = if padding = SAME then Owl_symbolic_types.SAME_UPPER else Owl_symbolic_types.VALID in Owl_symbolic_operator.conv ~name ~padding:pad ~strides sym_inputs.(0) sym_inputs.(1)

The basic idea is simple: find the type of symbol and its input node in CGraph, and then do the projection to symbolic representation. For most of the math operators such as sin, the projection is one to one, but that’s not all the cases. For some operations such as subtraction, we have Sub, SubScalar, ScalarSub, etc. depending on the type of input, but they can all be projected to the sub operator in symbolic representation. Or for the convolution operation, we need to first convert the parameters in a suitable way before the projection.

Let’s look at an example of using the Owl engine:

Here, we build a simple computation graph with the algorithmic differentiation module in Owl. Then we perform the conversion by calling OWL_Engine.to_symbolic.

We can also chain multiple engines together. For example, we can use the Owl engine to converge the computation defined in Owl to a symbolic graph, which can then be converted to an ONNX model and get executed on multiple frameworks. Here is such an example. A simple computation graph created by make_graph () is processed by two chained engines and generates an ONNX model.

let _ = let k = make_graph () |> OWL_Engine.to_symbolic |> ONNX_Engine.of_symbolic in ONNX_Engine.save k "test.onnx"

And this test.onnx file can further be processed with Python code as introduced in the previous section.

7.6 Summary

In this chapter, we briefly discussed the topic of supporting hardware accelerators in Owl. To improve the performance of computation, it is necessary to utilize the power of hardware accelerators, such as GPU, TPU, etc. It is a growing trend that the definition and execution of computation can be separated out. To this end, we built a symbolic representation based on Owl to facilitate exporting computations to other frameworks that support multiple hardware accelerators. This representation can be executed by multiple backend engines. Currently, it supports the ONNX, LaTeX, and Owl itself as engines. This chapter introduced the design of this symbolic representation and used several examples to demonstrate how the computation in Owl can be executed on other frameworks or visualized.

Author information

Authors and Affiliations

Helsinki, Finland
Liang Wang
Bejing, China
Jianxin Zhao

Authors

Liang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianxin Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, L., Zhao, J. (2023). Performance Accelerators. In: Architecture of Advanced Numerical Analysis Systems. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-8853-5_7

Download citation

DOI: https://doi.org/10.1007/978-1-4842-8853-5_7
Published: 27 December 2022
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-8852-8
Online ISBN: 978-1-4842-8853-5
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)

Publish with us

Policies and ethics