Perhaps one of the most important questions in a numerical library or software is, “how to make it run faster?” You can never have a piece of scientific computation program that runs too fast or takes too little memory. That is surely also the primary concern when we are designing Owl. In this chapter, we discuss the optimization of the Ndarray module, the core module that underlies almost all computation in Owl. We first introduce this module and how it interfaces to C code. Next, we briefly introduce the basic principles in optimizing numerical code, including some common techniques. Then we use the code examples in Owl to demonstrate how these techniques are applied in Owl. We finish this chapter with a bit of touch on the topic of automatic performance tuning in numerical libraries.

2.1 N-Dimensional Array in Owl

Owl numerical functions are organized in multiple modules, such as linear algebra, signal processing, statistics, optimization, etc. In real-world applications though, it is quite rare for users to only deal with scalar values in data analytics. In linear algebra, matrices are a common data object. In a neural network, the input images are often represented by three-dimensional arrays, each element being in the range of [0, 255]. Indeed, a solid implementation of n-dimensional array, or ndarray, lies at the heart of any industry-level numerical library, such as NumPy, Julia, MATLAB, etc.

The Ndarray module provides this core data structure in Owl. It relies on the Bigarray module provided in OCaml. Compared to normal OCaml arrays, a Bigarray is not limited in size, supports up to 16 dimensions, and supports more space-efficient storage. Here shows the definition of an Ndarray type:

type ('a, 'b) t = ('a, 'b, c_layout) Genarray.t

Here, the GADT type 'b is an element kind, such as single-precision (32 bits) floating-point numbers, 8-bit integers, etc. Type 'a represents the type of OCaml values that can be written into Bigarray or read back from it. The Bigarrays can contain various types, and in Ndarray we mainly support four:

type ('a, 'b) kind =   | Float32 : (float, float32_elt) kind   | Float64 : (float, float64_elt) kind   | Complex32 : (Complex.t, complex32_elt) kind   | Complex64 : (Complex.t, complex64_elt) kind

Namely, Owl mainly supports single-precision float (S), double precision (D), single-precision complex (C), and double-precision complex (Z). Supporting complex data types is essential to applications such as signal processing using Fourier transform.

Besides the ('a, 'b) part, the definition also includes a c_layout parameter. Bigarray supports two different memory layouts. In Owl, we stick with the C-style data layout, which means that indices start at 0 and in row-major format. Initially, Owl aims to support both layouts, but it soon turns out that would just open the jar of worm without much benefit.

As an example, if we need to create an ndarray of shape 2x3, elements of which are all 0s of type single-precision float, we can use

Dense.Ndarray.S.ones [|2;3|]

In this naming of modules, Dense indicates the data is densely stored instead of using sparse structure. Owl also supports Sparse ndarray types, and its API is quite similar to that of the dense type. It also contains the four different kinds of data types as stated earlier. Indeed, if you call Dense.Ndarray.S.ones [|2;3|], you can get sparsely stored zero ndarrays. There are two popular formats for storing sparse matrices, the Compressed Sparse Column (CSC) format and the Compressed Sparse Row (CSR) format. Owl uses the Compressed Sparse Row format. Compared to the dense ndarray, the definition of sparse ndarray contains some extra information:

type ('a, 'b) t =   { mutable s : int array   ; mutable h : (int array, int) Hashtbl.t   ; mutable d : ('a, 'b, c_layout) Array1.t   }

In this chapter, we focus on the optimization of dense ndarray structures and will not further discuss details in the sparse ndarray.

The second part of the name is Ndarray, which is of course the topic of this chapter, but it should be noted that we also support matrix data types. Implemented based on Ndarray, the Matrix supports operations that work solely on matrices, such as row_num, and can interoperate with ndarrays if the dimension is two. For example, we can perform

let x = Dense.Ndarray.S.ones [|2;3|];; let a = Dense.Matrix.S.row_num x;;

Besides the core implementation of data structure, the importance of the Ndarray module lies in the various operations it supports. They can be categorized into multiple types:

  • Creation functions that generate dense data structures, such as empty, create, zeros, and ones. These functions return ndarrays of the specified shape and content.

  • Mathematical operations. Many of them can be categorized into three generic operation patterns: map, fold, and scan. For example, the map function transforms one ndarray to another by applying the provided function to every element in the original ndarray, and operations such as sin and log belong to this pattern.

  • Slicing operation that extracts part of an ndarray or a matrix according to certain slice definition, an index list that indicates which indices should be accessed and in what order for each dimension of the value being sliced. Indexing and slicing are perhaps the most important ndarray functions in any numerical library.

  • Other calculation or manipulation functions: comparison, tile or repeat an ndarray, serialization, iterate over all the elements, etc.

It is always a good idea to push the performance of these operations forward, especially the mathematical functions. Contrary to many people’s impression on functional language, OCaml is actually quite fast.Footnote 1 That being said, it is unreasonable to claim that solely using OCaml can satisfy all the performance requirements of numerical computing. More often than not, we still need to rely on the power of C or FORTRAN, as in many other numerical computing libraries such as NumPy.

Therefore, in Owl we interface most ndarray operations to C implementation. In the next several sections, we first explain how to interface OCaml code to C and then introduce in detail the principles and techniques that are commonly applied to optimize the performance of computations.

2.2 Interface OCaml to C

To ensure its performance, we implement the core computation in the Ndarray module mostly in the C language and then interface them to OCaml. In this section, we briefly introduce how it is done. The corresponding code is mostly included in the src/owl/core/ directory in the source code. Let’s use the sine function in the Ndarray module as an example. This OCaml function is a wrapper of another OCaml function _owl_sin:

let _owl_sin : type a b. (a, b) kind -> int ->   (a, b) owl_arr -> (a, b) owl_arr -> unit =   fun k l x y ->     match k with     | Float32   -> owl_float32_sin l x y     | Float64   -> owl_float64_sin l x y     | Complex32 -> owl_complex32_sin l x y     | Complex64 -> owl_complex64_sin l x y

This function serves as an entrance to implementations of different sine functions. As we have introduced, Owl provides four data types: float, double, complex float, and complex double. This function takes a data type as input and applies the sine function on l elements in input array x, and the output is stored in array y. One of the called functions, for example, owl_float32_sin, is declared as an external function that interfaces from outside the OCaml world, a C function called “float32_sin”.

external owl_float32_sin : int   -> (`a, `b) owl_arr   -> (`a, `b) owl_arr -> unit = "float32_sin"

We do not implement the C function float32_sin directly, since we observe that the implementations of four different types of sine functions are mostly the same. Therefore, we utilize C macros. Here is the template (FUN4) that can be used to implement a series of mathematical functions:

#ifdef FUN4 CAMLprim value FUN4(value vN, value vX, value vY) {   CAMLparam3(vN, vX, vY);   int N = Long_val(vN);   INIT;   struct caml_ba_array *X = Caml_ba_array_val(vX);   NUMBER *X_data = (NUMBER *) X->data;   struct caml_ba_array *Y = Caml_ba_array_val(vY);   NUMBER *Y_data = (NUMBER *) Y->data;   NUMBER *start_x, *stop_x;   NUMBER *start_y;   caml_release_runtime_system();  /* Allow other threads */   start_x = X_data;   stop_x = start_x + N;   start_y = Y_data;   while (start_x != stop_x) {     NUMBER x = *start_x;     *start_y = (MAPFN(x));     start_x += 1;     start_y += 1;   };   caml_acquire_runtime_system();  /* Disallow other threads */   CAMLreturn(Val_unit); } #endif /* FUN4 */

This C function should satisfy certain specifications. The returned value of float32_sin must be of CAMLprim value type, and its parameters are of type value. Several macros can be used to convert these value types into the native C types. These macros are included in the <caml/mlvalues.h> header file. For example, we use Long_val to cast a parameter into an integer value and Caml_ba_array_val from an OCaml Bigarray type to an array of numbers. Finally, the computation itself is straightforward: apply the function MAPFN on every element in the array x in a for-loop, and the output is saved in array y. Since the returned value of the OCaml function _owl_sin is unit, this C function also needs to return the macro Val_unit.

Notice that we haven’t specified several macros in this template yet: the number type and the function to be applied on the array. For that, we use

#define FUN4 float32_sin #define NUMBER float #define MAPFN(X) (sinf(X)) #include OWL_NDARRAY_MATHS_MAP

Here, we specify a function of name float32_sin that uses the aforementioned template, where its number type is float, and the function to be applied on each element in the input array is sinf from standard C library <math.h>. We can easily change the macros here to use other functions to get the sine function of other precisions or other mathematical functions such as exp, log, etc.

To sum it up, when we call Dense.Ndarray.S.sin x in Owl, it ends up calling the sinf function on all float numbers in x. In this design, we can choose implementation in other libraries, such as the Intel Math Kernel Library (MKL), which can achieve superior performance on Intel-based machines. Actually, that’s indeed what we do for many other modules. Instead of handcrafting all the highly sophisticated linear algebra functions, etc., we interface their implementation to libraries such as OpenBLAS (linear algebra routines), SMTF (random number generator), FFTPack (fast Fourier transform), Cephes (special mathematical functions), etc. But we try to reduce reliance on third-party libraries to the minimum while avoiding reinventing wheels.

Except for the interface mechanism we have introduced, we can also utilize the foreign function interface to invoke routines in a different programming language from OCaml. This approach is nicely explained in Real World OCaml [38]. This book is recommended for anyone who wants to dig deep in the OCaml world.

Implementing the core code enables multiple optimization opportunities: single-instruction-multiple-data (SIMD), parallel processing on multicore hardware [4], loop unrolling, using tile to improve efficiency of cache [48], etc. Besides, we can also build systems to automatically choose suitable parameters in the Owl system to fit the hardware/software platform it runs on. We will talk about these topics in detail in the rest of this chapter.

2.3 Core Optimization of Computation

Ever since the creation of computers, one question kept resurfacing from time to time: “How to improve the performance of a piece of code?” Back in the early days of computers, using computers was a privilege, and users had to utilize every second of their limited computing time to the utmost. And nowadays, every PC gamer wants high-quality visual effects. Not to mention those scientific researchers who would be overjoyed if their simulation program running time can be cut by half.

In the good old days, the answer was simple. We can simply freeride Moore’s Law: “the number of transistors in a dense integrated circuit doubles about every two years.” A similar exponential increase shows in quite a lot of related areas. With more CPU clock speed, any code can automatically run faster, not to mention the dramatic increase of memory and storage capacity. All you need to do is to wait for a while.

However, the free lunch is over, and the clock speed has stagnated for years.Footnote 2 Faced with this situation, programmers need to achieve performance gains in fundamentally different ways on modern hardware. It generally includes two aspects: utilizing parallel computing mechanisms provided in processors and reducing accessing data from slow storage media.

Figure 2-1 shows a much simplified and yet thorough architecture illustration of a quad-core CPU.Footnote 3 It shows some of the core components in a processor. The CPU follows a “fetch-decode-execute” cycle to process instructions. It basically goes as described as follows. When the program counter indicates the next instruction to be executed, the Fetch/Decode fetches the instruction value from memory and decodes it into actions the computer can understand. Next, the Arithmetic Logic Unit (ALU) is invoked to perform required computation operations. This process repeats until the program stops. During this process, the Execution Context contains necessary state information, such as computing results, program counter, registers, etc. Besides these parts, the CPU also contains several levels of cache, which finally connect to memory. This model provides a quite good approximation to real CPUs. In the rest of this section, we will explain these different parts and how they benefit the performance of computation in detail.

Figure 2-1
A diagram demonstrates a detailed architecture of a quad-core C P U. The first 3 modules point to L 3 cache and the fourth module points to memory control that is connected to the memory bus.

Architectural illustration of a quad-core CPU

Figure 2-2
A model diagram illustrates the parallelism between the processors of multiple fetches and decode modules.

Processor parallelism: multiple fetch and decode modules

2.3.1 CPU Support of Parallel Computing

2.3.1.1 Instruction-Level Parallelism

A program consists of a stream of instructions. In a basic processor model, it executes one instruction in one clock cycle. But instructions do not strictly sequentially depend on one another. Instead, some of them can be executed in parallel without interference. Architecture that exploits such instruction-level parallelism (ILP) is called superscalar, which can date back to the 1960s. It can execute more than one instruction per clock cycle by dispatching multiple instructions at the same time. Therefore, the processor contains multiple fetch/decode and ALU modules, as Figure 2-2 shows. The parallelism among an instruction stream is automatically and dynamically discovered by the hardware during execution and thus is not visible to programmers.

Figure 2-3
A figure demonstrates the comparison of the conventional computing method and single instruction multiple data method.

Comparison of computing conventional and SIMD

2.3.2 Vectorization

The ALU also has the potential to provide more than one piece of data in one clock cycle. The single instruction multiple data (SIMD) refers to a computing method that uses one single instruction to process multiple data. It is in contrast with the conventional computing method that processes one piece of data, such as load, add, etc., with one construction. The comparison of these two methods is shown in Figure 2-3. In this example, one instruction processes four float numbers at the same time.

SIMD is now widely supported on various hardware architectures. Back in 1996, Intel’s MMX instruction set was the first widely deployed SIMD, enabling processing four 16-bit numbers at the same time. It is then extended to more instruction sets, including SSE, AVX, AVX2, etc. In AVX2, users can process 16 16-bit numbers per cycle. Similarly, the ARM architecture provides NEON to facilitate SIMD. Next is a simple example that utilizes Intel AVX2 instructions to add eight integers to another eight integers:

#include <immintrin.h> int main(int argc, char* argv[]) {     __m256i a = _mm256_set_epi32(1, 2, 3, 4, 5, 6, 7, 8);     __m256i b = _mm256_set_epi32(10, 20, 30, 40, 50, 60, 70, 80);     __m256i c = _mm256_add_epi32(a, b);     return 0; }

To use AVX instructions, programmers need to use the Intel intrinsics, included in header files such as immintrin .h. Here, _mm256_set_epi32 packed eight 32-bit integers together into a group, and _mm256_add_epi32 adds packed 32-bit integers. If you look at the assembly code in this simple program, part of it looks like

  ...   vpunpcklqdq   %xmm3, %xmm0, %xmm0   vpunpcklqdq   %xmm2, %xmm1, %xmm1   vinserti128   $0x1, %xmm1, %ymm0, %ymm0   vmovdqa   %ymm0, 8(%rsp)   vmovdqa   -24(%rsp), %ymm0   vmovdqa   %ymm0, 72(%rsp)   vmovdqa   8(%rsp), %ymm0   vmovdqa   %ymm0, 104(%rsp)   vmovdqa   72(%rsp), %ymm1   vmovdqa   104(%rsp), %ymm0   vpaddd    %ymm0, %ymm1, %ymm0   ...

Figure 2-4
This figure illustrates the process of context switching among threads to hide the execution latency and increase the overall throughput.

Switching context based on hardware threads to hide latency

It can be seen that instead of normal instructions such as add, AVX2 uses vpaddd to add two packed doubleword integers and also utilizes special registers such as ymm0, etc. In general, using SIMD can significantly improve the performance of computations.

2.3.2.1 Context Switching

As we have explained, the execution context in a processor contains necessary state information when executing instructions. But that does not mean one processor can only have one execution context. For example, the Intel Core i9-9900K contains two execution contexts, or “hardware threads.” That provides the possibility of concurrent processing. Specifically, a core can apply context switching to store execution state in one context while running instructions on the other if possible.

Note that unlike previous methods we have introduced, context switching does not enable parallel processing at exactly the same cycle; a core still has one ALU unit to process instructions. Instead, at each clock a core can choose to run an instruction on an available context. This is especially useful to deal with instruction streams that contain high-latency operations such as memory read or write. As shown in Figure 2-4, while one instruction is executing, perhaps waiting for a long while to read some data from memory, the core can switch to the other contexts and run another set of instructions. In this way, the execution latency is hidden and increases overall throughput.

2.3.2.2 Multicore Processor

What we have introduced so far focuses on one single core. But another idea about improving computation performance is more straightforward to a wider audience: multicore processor. It means integrating multiple processing units, or cores, on one single processor and enables the possibility of parallel processing at the same time. The processor development trend is to add more and more cores in a processor. For example, Apple M1 contains eight cores, with four high-performance cores and four high-efficiency cores. Intel Core i9-12900HX processor contains 16 cores.

Similar to previous mechanisms, just because a processor provides the possibility of parallel processing does not mean a piece of code can magically perform better by itself. The challenge is how to utilize the power of multiple cores properly from the OS and applications’ perspective. There are various approaches for programmers to take advantage of the capabilities provided by multicore processors. For example, on Unix systems the IEEE POSIX 1003.1c standard specifies a standard programming interface, and its implementations on various hardware are called POSIX threads, or Pthreads. It manages threads, such as their creation and joining, and provides synchronization primitives such as mutex, condition variable, lock, barrier, etc. The OCaml language is also adding native support for multicore, including parallelism via the shared memory parallel approach and concurrency. It will be officially supported in the OCaml 5.0 release.

Figure 2-5
A triangular figure demonstrates the memory hierarchy approach. It is divided into 6 layers, namely, registers, caches, main memory, flash disk, hard disk, and remote storage, from top to bottom.

Memory hierarchy

2.3.3 Memory

Besides the processor architecture, another core component, memory, also plays a pivotal role in the performance of programs. In the rest of this section, we introduce several general principles to better utilize memory properties. For more detailed knowledge about this topic, we highly recommend the article by U. Drepper [18].

Ideally, a processing unit only needs to access one whole memory, as the Von Neumann architecture indicates. However, such a universal memory would struggle to meet various real-world requirements: permanent storage, fast access speed, cheap, etc. Modern memory design thus has adopted the “memory hierarchy” approach which divides memory into several layers, each layer being a different type of memory, as shown in Figure 2-5. Broadly speaking, it consists of two categories: (1) internal memory that is directly accessible by the processor, including CPU registers, cache, and main memory; (2) external memory that is accessible to the processor through the I/O module, including flash disk, traditional disk, etc. As the layer goes down, the storage capacity increases, but the access speed and cost also decrease significantly.

The register is the closest to a computer’s processor. For example, an Intel Xeon Phi processor contains 16 general-purpose registers and 32 floating-point registers, each of 64-bit size. It also contains 32 registers for AVX instructions; the size of each is 256 or 512 bits. The access speed to registers is the fastest in the memory hierarchy, only taking one processor cycle. The next level is caches of various levels. In Figure 2-1, we have seen how a processor core connects directly to the various levels of caches. On an Intel Core i9-9900K, the cache size is 64KB (L1, each core), 256KB (L2, each core), and 16MB (shared), respectively. Its access speed is about ten cycles, with L1 being the fastest.

2.3.3.1 Cache

Due to the fast access time of cache compared to memory, utilizing cache is key to improving performance of computing. Imagine that if only all a program’s data accesses are directly from cache, its performance will reach orders of magnitude faster. Short of reaching that ideal scenario, one principle is to utilize cache as much as possible. Specifically, we need to exploit the locality of data access in programs, which means that a program tends to reuse data that is “close” to what it has already used. The meaning of “close” is twofold: first, recently used data is quite likely to be used again in the near future, called temporal locality; second, data with nearby addresses are also likely to be used recently. Later, we will discuss techniques based on these principles.

As a sidenote, due to the importance of caches, it is necessary to know the cache size on your computer. You can surely go through its manual or specification documentation or use commands such as lscpu. In the Owl codebase, we have employed the same approach as used in Eigen,Footnote 4 which is to use the cpuid instruction provided on x86 architectures. It can be used to retrieve CPU information such as the processor type and if features such as AVX are included. Overall, the routine query_cache_size first checks whether the current CPU vendor is x86 or x64 architecture. If not, it means cpuid might not be supported, and it only returns a conservative guess. Otherwise, it retrieves information depending on if the vendor is AMD or Intel. Here, the macros OWL_ARCH_x86_64 and OWL_ARCH_i386 are implemented by checking if predefined system macros such as __x86_64__, _M_X64, __amd64, __i386, etc. are defined in the compiler. The CPUID macro is implemented using assembly code utilizing the cpuid instruction.

void query_cache_sizes(int* 11p, int* 12p, int* 13p) {   if (OWL_ARCH_i386 || OWL_ARCH_x86_64) {     int cpuinfo[4];     CPUID(cpuinfo, 0x0, 0);     if (cpu_is_amd(cpuinfo)) {       query_cache_sizes_amd(11p, 12p, 13p);       return;     }     int highest_func = cpuinfo[1];     if (highest_func >= 4)       query_cache_sizes_intel(11p, 12p, 13p);     else {       *11p = 32 * 1024;       *12p = 256 * 1024;       *13p = 2048 * 1024;     }   } else {     *11p = 16 * 1024;     *12p = 512 * 1024;     *13p = 512 * 1024;   } } OWL_INLINE void query_cache_sizes_intel(int* 11p, int* 12p, int* 13p) {   int cpuinfo[4];   int 11 = 0, 12 = 0, 13 = 0;   int cache_id = 0;   int cache_type = 0;   do {     cpuinfo[0] = cpuinfo[1] = cpuinfo[2] = cpuinfo[3] = 0;     CPUID(cpuinfo, 0x4, cache_id);     cache_type = (cpuinfo[0] & 0x0F) >> 0;     if(cache_type == 1 || cache_type == 3) {       int cache_level = (cpuinfo[0] & 0xE0) >> 5;       int ways        = (cpuinfo[1] & 0xFFC00000) >> 22;       int partitions  = (cpuinfo[1] & 0x003FF000) >> 12;       int line_size   = (cpuinfo[1] & 0x00000FFF) >>  0;       int sets        = (cpuinfo[2]);       int cache_size = (ways + 1) * (partitions + 1)         * (line_size + 1) * (sets + 1);       switch(cache_level) {         case 1: 11 = cache_size; break;         case 2: 12 = cache_size; break;         case 3: 13 = cache_size; break;         default: break;       }     }     cache_id++;   } while(cache_type > 0 && cache_id < 16);   if (11 == 0) 11 = 32 * 1024;   if (12 == 0) 12 = 256 * 1024;   if (13 == 0) 13 = 2048 * 1024;   *11p = 11; *12p = 12; *13p = 13;   return; }

2.3.3.2 Prefetching

To mitigate the long loading time of memory, prefetching is another popular approach. As the name suggests, the processor fetches data into cache before it is demanded, so that when it is actually used, the data can be accessed directly from the cache. Prefetching can be triggered in two ways: via certain hardware events or explicit request from the software.

Naturally, prefetching faces challenges from two aspects. The first is to know what content should be prefetched from memory. A cache is so precious that we don’t want to preload useless content, which leads to a waste of time and resources. Secondly, it is equally important to know when to fetch. For example, fetching content too early risks getting it removed from the cache before even being used.

For hardware prefetching, the processor monitors memory accesses and makes predictions about what to fetch based on certain patterns, such as a series of cache misses. The predicted memory addresses are placed in a queue, and the prefetch would look just like a normal READ request to the memory. Modern processors often have different prefetching strategies. One common strategy is to fetch the next N lines of data. Similarly, it can follow a stride pattern: if currently the program uses data at address x, then prefetch that at x+k, x+2k, x+3k, etc.

Compared with the hardware approach, software prefetching allows control from programmers. For example, a GCC intrinsics is for this purpose:

void __builtin_prefetch (const void *addr, ...)

It contains three arguments. The first is the data address to be prefetched; the second is a compile-time integer that indicates if the prefetch is preparing for a read from or write to memory; and the final one indicates the temporal locality of fetched data to decide if it should be evicted from cache once accessed.

The programmer can insert __builtin_prefetch into code if the corresponding data is anticipated to be accessed soon. This intrinsic will be compiled into data prefetch instructions via the compiler. If the prefetch is executed at a proper moment before the access, ideally the required data will be already in the cache by the time it is used. The following code is a simple example to demonstrate how it works in a C code:

for (i=0; i < n; ++i)   for (j=0; j < m; ++j) {     __builtin_prefetch(&x[i+1][j]);     c += x[i][j]   }

There are also other approaches for software control on prefetching, such as the _mm_prefetch (char const* p, int i) intrinsics from the SSE instruction set on Intel. It prefetches a line of data from memory that contains address p to a location in the cache hierarchy; argument i indicates the level of locality of cached data.

However, it is still tricky to do it right; sometimes, improper prefetching can even make the execution slower. For one thing, it is normally difficult for us to know exactly how far ahead the data should be fetched, especially in applications that access memory irregularly. Frequent early prefetch actually reduces cache hit accuracy. Besides, the locality pattern of different chunks of data is also complex to manage for programmers. That’s why this approach should be used with caution.

Figure 2-6
Two parts in the figure illustrate the difference in memory access by core processors in the uniform memory access and non-uniform memory access designs.

Changing from uniform memory access to non-uniform memory access

2.3.3.3 NUMA

Finally, we briefly talk about non-uniform memory access (NUMA), since it demonstrates the hardware aspect of improving memory access efficiency. We have mentioned the multicore design of processors. While improving parallelism in processors, it has also brought challenges to memory performance. When an application is processed on multiple cores, only one of them can access the computer’s memory at a time, since they access a single entity of memory uniformly. In a memory-intensive application, this leads to a contention for the shared memory and a bottleneck. One approach to mitigate this problem is the non-uniform memory access (NUMA) design. Compared with the uniform access model we have introduced, NUMA separates memory for each processor, and thus each processor can access its own share of memory at a fairly low cost, as shown in Figure 2-6. However, the performance of NUMA depends on the task being executed. If one processor needs to frequently access memory of the other processors, that would lead to undesirable performance.

2.4 Optimization Techniques

The topic of computation optimization is a classic topic in computer science, and there is still a lot of work about it in both academia and industry. Explaining even only a part of them in detail requires a whole book. Instead, in this section we give some optimization technique examples to demonstrate how the principles in the previous sections are implemented.

2.4.1 Hardware Parallelization

Utilizing multicore is a straightforward way to improve computation performance, especially complex computation on arrays with a huge amount of elements. Except for the Pthread we have introduced, Open Multi-Processing (OpenMP) is another tool that is widely used. OpenMP is a library that provides APIs to support shared memory multiprocessing programming in C/FORTRAN languages on many platforms. An OpenMP program uses multiple threads in its parallel section, and it also sets up the environment in the sequential execution section at the beginning. The parallel section is marked by the OpenMP directive omp pragma. Each thread executes the parallel section and then joins together after finishing. Here is an example:

#include <omp.h> int main(int argc, char **argv) {     int x[200000];     #pragma omp parallel for     for (int i = 0; i < 100000; i++) {         x[i] = sinf(i);     }     return 0; }

Here, in a big array of 200000 elements, for each element we compute the sin function on its index number. To apply multicore computing, we simply add one line of derivative on the for-loop. Without specifying the number of threads, it divides the whole workload, the array, onto all available cores.

In Owl, we have also applied OpenMP to improve computation performance. For example, we have introduced the template to map a single function on all elements in an array. We can now change part of the template as follows. Here, caml_release_runtime_system releases the master lock in a calling thread, enabling other threads to run code in parallel with the execution of the current thread:

  ...   caml_release_runtime_system();   start_x = X_data;   stop_x = start_x + N;   start_y = Y_data;   #pragma omp parallel for schedule(static)   for (int i = 0; i < N; i++) {     NUMBER x = *(start_x + i);     *(start_y + i) = (MAPFN(x));   }   caml_acquire_runtime_system();   ...

We can also benefit from SIMD. For example, instead of interfacing to standard C math library functions, we can implement our own SIMD version of math functions. It is unfortunately not as simple as adding one line of derivative, since the SIMD intrinsics do not include complex computations. Even for one sine function, for example, we need to carefully implement the Taylor expansion–based algorithm using various existing intrinsics. Not to mention that we need to always think about different versions of SIMD: SSE, AVX2, AVX512, etc., or different hardware vendors. In summary, the performance boost using SIMD requires a significant amount of engineering work.

2.4.2 Cache Optimization

There are numerous cache optimization techniques, but most of them share the same theme: improve data locality (both spatial and temporal) and align the code and data. If you put something into cache, you’d better make it count: reusing cached data as much as possible. Next, we will use matrix multiplication as an example. Matrix multiplication is one of the center pieces in scientific computing. Its basic algorithm is simple:

for (i = 0; i < N; ++i)   for (j = 0; j < N; ++j)     for (k = 0; k < N; ++k)       r[i][j] += mul1[i][k] * mul2[k][j];

The way the data is read is from left to right: (0, 0), (0,1), … (0, n), (1, 0), (1, 1), … (1, n), …. While the element (0, 0) is loaded, the next several elements are also saved in the cache so that (0, 1), (0, 2), etc. are all loaded from cache instead of memory. However, the elements in mul2 are not accessed this way. After (0, 1), the elements (1, 0), (2, 0), … are required. That means the cached elements are all wasted. One approach to deal with this problem is to transpose mul2 before multiplication:

double tmp[N][N]; for (i = 0; i < N; ++i)   for (j = 0; j < N; ++j)     tmp[i][j] = mul2[j][i]; for (i = 0; i < N; ++i)   for (j = 0; j < N; ++j)     for (k = 0; k < N; ++k)       r[i][j] += mul1[i][k] * tmp[j][k];

Another approach even utilizes L1 cache better. First, it “cuts” a large matrix into multiple smaller square ones. Each line of such a square matrix can be fit into an L1 cache. These multiple smaller matrix multiplications are iterated in an outer loop. This technique is sometimes called tiling. The algorithm can be demonstrated using Figure 2-7. In a smaller matrix multiplication ab = c, a row of a is stored in L1 cache (Step ➀). The column number moves in matrix b to compute the corresponding output in c (Step ➁). Only after this step the cached line will be evicted and a new row in a will be retrieved into the cache (Step ➂). The previous rows in a will not be used again. The algorithm is implemented as follows. Here, E is the number of elements in a row of the small matrix:

Figure 2-7
A three-part diagram illustrates the matrix multiplication algorithm using L 1 cache.

Illustration of a matrix multiplication algorithm that utilizes L1 cache

for (i = 0; i < N; i += E)   for (j = 0; j < N; j += E)     for (k = 0; k < N; k += E)       for (ii = 0, rr = &r[i][j],            am = &mul1[i][k]; ii < E;            ++ii, rr += N, am += N)         for (kk = 0, bm = &mul2[k][j];              kk < E; ++kk, bm += N)           for (jj = 0; jj < E; ++jj)             rr[jj] += am[kk] * bm[jj];

Another technique that utilizes cache is loop merging. We can merge consecutive loops that sweep through data into one loop to reuse data in the cache, reducing memory access. The following code shows a simple example:

for (i = 0; i < N; i++) {     x[i] = buf[i] } for (i = 0; i < N; i++) {     y[i] = i * x[i] + bias }

Obviously, these two loops can be fused into one single loop:

for (i = 0; i < N; i++) {     x[i] = buf[i]     y[i] = i * x[i] + bias }

By fusing two loops into one, the access order of the x elements changed, increasing temporal locality. It can further be accelerated using techniques such as parallel computing techniques we have mentioned. For some of the cases that loop merging cannot be directly applied, the loop alignment technique may help. For example, the two for-loops

for (i=0; i<n; i++) {     x[i] = y[i] + a } for (i=0; i<n; i++) {     z[i] = x[i+1] + b }

can be fused after aligning the x and z arrays:

x[0] = y[0] + a for (i=1; i<n; i++) {     x[i] = y[i] + a     z[i-1] = x[i] + b } z[n-1] = x[n] + b

2.4.3 Other Techniques

Besides processor parallelism and cache utilization, there are still many techniques to improve code performance. We will only briefly introduce some of them in this part.

Compilers surely have a great impact on the code performance. For example, compilers such as LLVM or GCC can be configured with plenty of options and flags. Choosing the most suitable options can actually be a challenging task. Besides, programmers can add inline assembly code to C to further increase the execution speed. Another optimization technique, unrolling, is also partly about understanding how compilers work. For example, we can unroll the for-loop into eight parts:

for(i=0;i<n;i++) {     a[i] = b[i] + 1 } for(i=0; i<n; i+=8) {     a[i] = b[i] + 1     a[i] = b[i+1] + 1     a[i] = b[i+2] + 1     a[i] = b[i+3] + 1     a[i] = b[i+4] + 1     a[i] = b[i+5] + 1     a[i] = b[i+6] + 1     a[i] = b[i+7] + 1 }

It allows the compiler to decrease the number of conditional branches, thus reducing potential branch mispredictions and condition evaluations.

Despite what we have explained, note that cache is not always helping. Sometimes, the data is put into cache, but won’t be used again in a short while. That means the cache is just wasting time on writing without being read. In that case, it is necessary to bypass the caching phase. Processors support nontemporal write directly to memory. SIMD also provides intrinsics to do that, such as the following intrinsics.

Figure 2-8
A figure demonstrates the convolution algorithm. It has kernel 2, 2, 3, 3, input 1, 4, 4, 3, equals convolution params stride 1 and padding 0. The table has 10 columns and 9 rows in the table with numbers.

Convolution algorithm illustration

#include <ammintrin.h> void _mm_stream_sd(double *p, __m128d a); void _mm_stream_ss(float *p, __m128 a);

2.5 Example: Convolution

Convolution is a family of mathematical operations that is arguably the most important operation in deep neural networks. It makes up the backbone of a majority of deep neural network architectures and takes up a large part of computation resources involved in their training and inference. According to the shape of input, convolution operations can be categorized into one dimensional, two dimensional, and three dimensional. It can also be categorized according to usage in the forward or backward propagation phase as normal convolution, backward convolution on kernel, and backward convolution on input. There are special operations such as transpose convolution, dilated convolution, etc. But their implementation principles are quite similar. There is a lot of work on optimizing convolution operations due to their importance [55]. It takes significant engineering effort to implement only part of them. In this section, we use the two-dimensional convolution operation Conv2D as an example to demonstrate how we apply various optimization techniques on convolution operations in Owl.

A convolution operation takes two ndarrays as input: image (I) and kernel (F). In a two-dimensional convolution, both ndarrays are of four dimensions. The image ndarray has B batches; each image has size H × W and has IC channels. The kernel ndarray has R rows, C columns, the same input channel IC, and output channel K. The convolution can then be expressed as in Eq. 2.1.

$$ {CONV}_{b,h,w,k}=\sum \limits_{ic=1}^{IC}\sum \limits_{r=1}^R\sum \limits_{c=1}^C{I}_{b,h+r,w+c, ic}{F}_{r,c, ic,k}. $$
(2.1)

The convolution operation is first implemented in Owl by interfacing to the Eigen library, which is also used in TensorFlow for convolution implementation on the CPU. However, interfacing to this C++ library proves to be problematic and leads to a lot of installation issues. That is why we turn to handcrafting convolution operations. They consist of three types: Conv, ConvBackwardKernel, and ConvBackwardInput. The Conv operation calculates the output given the input image and kernel. Similarly, ConvBackwardKernel calculates the kernel given the input and output ndarrays, and ConvBackwardInput gets input ndarray from the kernel and output. The last two are mainly used in the backpropagation phase in training a DNN, but all three operations share a similar calculation algorithm.

A naive convolution algorithm is to implement Eq. 2.1 with nested for-loops. It is easy to see that this approach does not benefit from any parallelization and thus is not suitable for production code.

The next version of implementation uses the im2col algorithm. This algorithm is illustrated in Figure 2-8. In this example, we start with an input image of shape 4x4 and three output channels. Each channel is denoted by a different color. Besides, the index of each element is also shown in the figure. The kernel is of shape 2x2 and has three input channels as the input image. Each channel has the same color as the corresponding channel of the input image. The two output channels are differentiated by various levels of transparency in the figure. According to the definition of convolution operation, we use the kernel to slide over the input image step by step, and at each position, an element-wise multiplication is applied. Here, in this example, we use a stride of 1 and a valid padding. In the first step, the kernel starts with the position where the element indices are [1,2,5,6] in the first input channel, [17,18,21,22] in the second input channel, and [33,34,37,38] in the third input channel. The element-wise multiplication result is filled into the corresponding position in the output ndarray. Moving on to the second position, the input indices become [2,3,6,7,18,19,22,23,34,35,38,39], so on and so forth. It turns out that this process can be simplified as one matrix multiplication. The first matrix is just the flattened kernel. The second matrix is based on the input ndarray. Each column is a flattened subblock of the same size as one channel of the kernel. This approach is the basic idea of the im2col algorithm. Since the matrix multiplication is a highly optimized operation in linear algebra packages such as OpenBLAS, this algorithm can be executed efficiently. Here, we show its implementation code:

  CAMLprim value FUN_NATIVE (spatial) (     value vInput_ptr, value vKernel_ptr, value vOutput_ptr,     value vBatches, value vInput_cols, value vInput_rows, value vIn_channel,     value vKernel_cols, value vKernel_rows,     value vOutput_cols, value vOutput_rows, value vOut_channel,     value vRow_stride,  value vCol_stride,     value vPadding, value vRow_in_stride, value vCol_in_stride   ) {     struct caml_ba_array *IN = Caml_ba_array_val(vInput_ptr);     struct caml_ba_array *KE = Caml_ba_array_val(vKernel_ptr);     struct caml_ba_array *OU = Caml_ba_array_val(vOutput_ptr);     TYPE *input_ptr  = (TYPE *) IN->data;     TYPE *kernel_ptr = (TYPE *) KE->data;     TYPE *output_ptr = (TYPE *) OU->data;     int batches       = Long_val(vBatches);     int input_cols    = Long_val(vInput_cols);     int input_rows    = Long_val(vInput_rows);     int in_channel    = Long_val(vIn_channel);     int kernel_cols   = Long_val(vKernel_cols);     int kernel_rows   = Long_val(vKernel_rows);     int output_cols   = Long_val(vOutput_cols);     int output_rows   = Long_val(vOutput_rows);     int out_channel   = Long_val(vOut_channel);     int row_stride    = Long_val(vRow_stride);     int col_stride    = Long_val(vCol_stride);     int padding       = Long_val(vPadding);     int row_in_stride = Long_val(vRow_in_stride);     int col_in_stride = Long_val(vCol_in_stride);     const int input_cri  = in_channel  * input_rows  * input_cols;     const int input_ri   = in_channel  * input_rows;     const int output_cri = out_channel * output_rows * output_cols;     const int output_cr  = output_rows * output_cols;     const int output_crb = output_rows * output_cols * batches;     const int kernel_cri = kernel_cols * kernel_rows * in_channel;     const int kernel_cr  = kernel_cols * kernel_rows;     const int kernel_ri  = kernel_rows * in_channel;     memset(output_ptr, 0, batches * output_cri * sizeof(TYPE));     INIT;     int pr = 0, pc = 0;     if (padding != 1) {       pr = (row_stride * ( output_rows - 1) + kernel_rows - input_rows) / 2;       pc = (col_stride * ( output_cols - 1) + kernel_cols - input_cols) / 2;       if (pr < 0) pr = 0;       if (pc < 0) pc = 0;     }     TYPE *inpt2d = (TYPE *) calloc(mat_size, sizeof(TYPE));     if (inpt2d == NULL) exit(1);     ...

The code starts by locating the starting pointers of inputs (input and kernel) and the various metadata about inputs: input channel, row/column numbers, output channel, stride, padding size, etc. Besides, it also assigns memory space for outputs and intermediate buffers. The code next implements what we have introduced. Using three for-loops, we fill in the intermediate input buffer inpt2d, which is one matrix, and multiply it with the kernel matrix using the GEMM routine provided by OpenBLAS.

    ...     for (int i = 0; i < output_crb; ++i) {       int bt = i / output_cr;       int cr = i % output_cr;       int c = cr / output_rows;       int r = cr % output_rows;       const int cstart = c * col_stride - pc;       const int rstart = r * row_stride - pr;       const int cend = cstart + kernel_cols;       const int rend = rstart + kernel_rows;       const int input_idx_base = bt * input_cri;       int cnt = 0;       for (int a = cstart; a < cend; ++a) {         for (int b = rstart; b < rend; ++b) {           for (int h = 0; h < in_channel; ++h) {             if (a < input_cols && a >= 0 &&                 b < input_rows && b >= 0) {               int input_idx =                  input_idx_base + a * input_ri + b * in_channel + h;               inpt2d[i * kernel_cri + cnt] = input_ptr[input_idx];             }             ++cnt;           }         }       }     }     GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans,       output_crb, out_channel, kernel_cri, ALPHA,       inpt2d, kernel_cri, kernel_ptr, out_channel,       BETA, output_ptr, out_channel);     free(inpt2d);     return Val_unit;   }

However, this algorithm requires generating a large temporary intermediate matrix. Depending on the input image size, this matrix can take gigabytes of memory in applications such as FST. If you look closely at the intermediate matrix, you will find that it contains a lot of redundant information. Algorithms such as memory-efficient convolution [11] aim to reduce the size of this intermediate matrix, but still fail with large input or kernel sizes.

To reduce the memory usage, we apply the method proposed in [25], which is to cut matrices into small blocks so as to fit into the L1/L2 cache of the CPU to do high-performance computation while reducing the memory usage, regardless of the input size. The multiplication of two matrices can be divided into a multiplication of small blocks. It still generally follows the previous matrix multiplication approach, but instead of generating the whole intermediate matrix, it cuts the input and kernel matrices into small blocks one at a time so that the memory usage is limited no matter how large the input and kernel are. Next, we show the code:

  int mat_size = kernel_cri * output_crb;   if (mat_size / kernel_cri == output_crb && mat_size < IM2COL_THRESHOLD) {     // if generated input matrix is small enough, use im2col implementation   }   int mc = output_crb;   int kc = kernel_cri;   int nc = out_channel;   compute_block_sizes(&kc, &nc, &mc, sizeof(TYPE));

Suitable implementations can be chosen depending on the input size. Here, we use the intermediate matrix size to decide if we need the memory-efficient implementation or not. If it is sufficiently small, we use the previous im2col implementation. It is still straightforward and fast with small input sizes. Otherwise, we compute the suitable small block sizes as in [25].

#if defined(AVX_PSIZE) && defined(_WIN32)   int fast_flag = (in_channel % AVX_PSIZE == 0);   TYPE *temp_mk = _aligned_malloc(mc * kc * sizeof(TYPE), ALIGN_SIZE);   if (temp_mk == NULL) exit(1); #elif defined(AVX_PSIZE)   int fast_flag = (in_channel % AVX_PSIZE == 0);   TYPE *temp_mk = NULL;   if (posix_memalign((void**) &temp_mk, ALIGN_SIZE, mc * kc * sizeof(TYPE)))     exit(1); #else   TYPE *temp_mk = (TYPE *) calloc(mc * kc, sizeof(TYPE));   if (temp_mk == NULL) exit(1); #endif TYPE *temp_kn = (TYPE *) calloc(nc * kc, sizeof(TYPE)); if (temp_kn == NULL) exit(1); TYPE *temp_mn = (TYPE *) calloc(mc * nc, sizeof(TYPE)); if (temp_mn == NULL) exit(1);

To further improve the performance, we use the SIMD intrinsics in filling the temporary matrix from input ndarray. For one thing, depending on whether the input channel is divisible by the supported data length AVX_PSIZE of SIMD (e.g., 8 float numbers for AVX), we provide two sets of implementations for filling the temporary blocks. We then assign space for the small blocks that can be fit into cache accordingly.

  ...   for (int m = 0; m < output_crb; m += mc) {     int actual_mc = fminf(m + mc, output_crb) - m;     for (int k = 0; k < kernel_cri; k += kc) {       memset(temp_mk, 0, mc * kc * sizeof(TYPE));       int actual_kc = fminf(k + kc, kernel_cri) - k;       #ifdef AVX_PSIZE       int kc_strip = (actual_kc / AVX_PSIZE) * AVX_PSIZE;       #endif       // iterate along each row of the generated input matrix.       ...       int idx_kn_base = k * out_channel;       for (int n = 0; n < out_channel; n += nc) {         int actual_nc = fminf(n + nc, out_channel) - n;         idx_kn_base += n;         // fill in the kernel matrix         int cnk = 0;         for (int ik = 0; ik < actual_kc; ik++) {           for (int jn = 0; jn < actual_nc; jn++) {             int index_kn = idx_kn_base + ik * out_channel + jn;             temp_kn[cnk++] = kernel_ptr[index_kn];           }         }         GEMM(CblasRowMajor, CblasNoTrans, CblasNoTrans,           actual_mc, actual_nc, actual_kc, ALPHA,           temp_mk, actual_kc, temp_kn, actual_nc,           BETA, temp_mn, actual_nc);         int cmn = 0;         for (int ix = 0; ix < actual_mc; ix++) {           for (int iy = 0; iy < actual_nc; iy++) {             int index_mn = (ix + m) * out_channel + (iy + n);             output_ptr[index_mn] += temp_mn[cmn++];           }         }       }     }   }   free(temp_mk);   free(temp_kn);   free(temp_mn);   return Val_unit; }

The code next follows a similar pattern as the previous method, filling in the input and kernel matrices and multiplying them to get the output, only that both need more detailed control to get smaller matrices to fit into cache. Specifically, here is the code to get the input matrix:

  int cmk = 0;   for (int im = 0; im < actual_mc; im += 1) {     int b  = (m + im) / output_cr;     int cr = (m + im) - b * output_cr;     int c = cr / output_rows;     int r = cr - c * output_rows;     const int cstart = c * col_stride - pc;     const int rstart = r * row_stride - pr;     const int idx_base = b * input_cri;     // fill in the sub input matrix #ifdef AVX_PSIZE     if (fast_flag) {       ACX_FUN_LOAD (load_sub_matrix_fast, spatial) (         input_ptr, temp_mk, &cmk, kc_strip, k, kernel_ri, input_ri,         in_channel, idx_base, cstart, rstart, input_cols, input_rows, 0);     }     else {       ACX_FUN_LOAD (load_sub_matrix, spatial) (         input_ptr, temp_mk, &cmk, kc_strip, actual_kc,         k, kernel_ri, input_ri, in_channel, idx_base,         cstart, rstart, input_cols, input_rows, kernel_rows, 0);     } #else     for (int ik = 0; ik < actual_kc; ik += 1) {       int kc  = (k + ik) / kernel_ri;       int kri = (k + ik) - kc * kernel_ri;       int kr  = kri / in_channel;       int ki  = kri - kr * in_channel;       int input_col = kc + cstart;       int input_row = kr + rstart;       if (input_col < input_cols && input_col >= 0 &&         input_row < input_rows && input_row >= 0) {         int input_index = idx_base + input_col * input_ri           + input_row * in_channel + ki;         temp_mk[cmk] = input_ptr[input_index];       }       cmk++;     } #endif }

To maximize the performance of caching, we need to make the memory access as consecutive as possible. Depending on whether the input channel is divisible by the supported data length of SIMD (e.g., 8 float numbers for AVX), we provide two sets of implementations for filling the temporary blocks. If the input channel is divisible by data length, the input matrix can always be loaded consecutively at a step of data length with the AVX intrinsics; otherwise, I have to build the temporary matrix blocks with less AVX intrinsics, on only part of the matrix, and then take care of the edge cases. During loading data from input ndarrays to these matrix blocks, we also use AVX intrinsics such as _mm256_load_ps to improve performance.

#define FUN_NATIVE(dim) stub_float32_ndarray_conv ## _ ## dim  ## _ ## native #define FUN_BYTE(dim) stub_float32_ndarray_conv ## _ ## dim  ## _ ## bytecode #define TYPE float #define INIT #define ALPHA 1. #define BETA 0. #define GEMM cblas_sgemm #ifdef OWL_AVX   #define AVX_PSIZE 8   #define AVX_TYPE __m256   #define ACX_FUN_LOAD(prefix, dim) prefix ## _ ## float32 ## _ ## dim   #define AVX_STOREA _mm256_store_ps   #define AVX_STOREU _mm256_storeu_ps   #define AVX_LOADA  _mm256_load_ps   #define AVX_LOADU  _mm256_loadu_ps   #define AVX_ADD    _mm256_add_ps #endif #include "owl_ndarray_conv_impl.h" #undef GEMM #undef BETA #undef ALPHA #undef INIT #undef TYPE #undef FUN_BYTE #undef FUN_NATIVE #ifdef OWL_AVX   #undef AVX_PSIZE   #undef AVX_TYPE   #undef ACX_FUN_LOAD   #undef AVX_STOREA   #undef AVX_STOREU   #undef AVX_LOADA   #undef AVX_LOADU   #undef AVX_ADD #endif

Finally, we reduce the implementation complexity by applying templates, abstracting elements such as function names, SIMD functions to be used, GEMM routines, etc. These implementations can then be easily extended to the three-dimensional and one-dimensional cases. Besides, the transpose convolutions and diluted convolutions are only variations of normal convolution, and the code only needs to be slightly changed. Above this C implementation level, mutable convolution operations are also provided, so as to further improve performance by utilizing existing memory space.

2.6 Automated Empirical Optimization of Software

We have introduced various optimization techniques so far, and together they comprise a very large optimization space. For example, we can apply both multicore and SIMD parallelism while utilizing certain cache techniques. We need to consider many different methods to apply, each with numerous possible parameters to tune. Sometimes, these techniques even conflict with each other, for example, each trying to take as much processor resource as possible. Worse still, with the heterogeneity of hardware devices, an “optimal” configuration on one machine may lead to degraded performance. It is thus of utmost importance for a numerical library to provide good performance on all devices it supports. For example, ATLAS and the recent Intel Math Kernel Library both provide optimized mathematical routines for science and engineering computations. They are widely used in many popular high-level platforms such as Matlab and TensorFlow. One of the reasons these libraries can provide good performance is that they have adopted the paradigm of Automated Empirical Optimization of Software, or AEOS. That is, a library chooses the best method and parameter to use on a given platform to do a required operation. One highly optimized routine may run much faster than a naively coded one. Optimized code is usually platform and hardware specific, so an optimized routine on one machine could perform badly on the other. Though Owl currently does not plan to improve the low-level libraries it depends on, as an initial endeavor to apply the AEOS paradigm in Owl, one ideal tuning point is the parameters of OpenMP used in Owl.

Figure 2-9
A double bar graph illustrates the application of the sine function on a n d array in Owl makes the Open M P version take lesser execution time than the non-Open M P version.

Parallel execution of the sin operation on ndarray using OpenMP

Figure 2-10
Two line charts depict time in meter seconds versus input array size for op abs and another graph with input array size for opsin. The comparison of the performances of the Open M P version and non-Open M P version, with growing input size for the sine operation and abs operation.

Compare the behavior of abs and sine when using OpenMP

Currently, many computers contain shared memory multiprocessors. OpenMP is used in key operations in libraries such as Eigen and MKL. Owl has also utilized OpenMP on many mathematical operations to boost their performance by threading calculation. For example, Figure 2-9 shows that when we apply the sine function on an ndarray in Owl, on a 4-core CPU MacBook, the OpenMP version only takes about a third of the execution time compared with the non-OpenMP version.

However, performance improvement does not come for free. The overhead of using OpenMP comes from time spent on scheduling chunks of work to each thread, managing locks on critical sections, startup time of creating threads, etc. Therefore, when the input ndarray is small enough, these overheads might overtake the benefit of threading.

What is a suitable input size to use OpenMP then? This question would be easy to solve if there is one single suitable input size threshold for every operation, but that is not the case. In a small experiment, we compare the performance of two operations, abs (absolute value) and sin, in three cases: running them without using OpenMP, with two-thread OpenMP, and with four-thread OpenMP.

The result in Figure 2-10 shows that, with growing input size, for the sine operation, the OpenMP version outperforms the non-OpenMP version at a size of less than 1000, but when using abs operation, that cross point is at about 1,000,000. The complexity of math operations varies greatly, and the difference is even starker when we compare their performance on different machines. Note that both axes use a log scale, and that is why a small deviation when the input array size is small looks large in the figure.

This issue becomes more complex when considered in real applications such as DNN, where users need to deal with operations of vastly different complexity and input sizes. Thus, one fixed threshold for several operations is not an ideal solution. Considering these factors, we need a fine-grained method to decide a suitable OpenMP threshold for each operation.

Toward this end, we implement the AEOS module in Owl. The idea is to add a tuning phase before compiling and installing Owl, so that each operation learns a suitable threshold parameter to decide if OpenMP should be used or not, depending on the input size. The key idea of parameter tuning is simple. We implement two versions of each operation, one using OpenMP and the other not. We then measure their execution time for various sizes of input. Each measurement is repeated multiple times, and, to reduce the effect of outliers, only the values that are within the first and the third quartiles are used. After removing outliers, regression is performed to find a suitable input size threshold. According to our initial experiment, linear regression is fit to estimate the OpenMP parameters here. Since this tuning phase is executed before compiling Owl, the AEOS module is independent of Owl, and all necessary implementation is coded separately to ensure that future changes of Owl do not affect the AEOS module itself.

The tuned parameters then need to be passed to Owl. When the OpenMP switch is turned on, the AEOS module generates a C header file which contains the definition of macros, each of which defines a threshold for one operation. When this header file is not generated, predefined default macro values are used instead. After that, Owl is compiled with this header file and uses these tuned parameters in its math operations. The tuning phase only needs to be performed once on each machine during installation.

The design of the AEOS module focuses on keeping tuning simple, effective, and flexible. Each operation is implemented as a single OCaml module, so that support for new operations can be easily added. The interface of such a module is shown as follows. We expect that tuning does not have to be only about OpenMP parameters and that different regression methods could be used in the future. For example, the Theil-Sen estimator can be plugged in for parameter estimation if necessary. In each module, arbitrary tuning procedures can be plugged in as long as the interface is satisfied.

module Sin = struct (** Tuner type definition. *) type t = {     mutable name  : string;     mutable param : string;     mutable value : int;     mutable input : int array array;     mutable y     : float array } val make : unit -> t (** Create the tuner. *) val tune : t -> unit (** Tuning process. *) val save_data : t -> unit (** Save tuned data to csv file for later analysis. *) val to_string : t -> string (** Convert the tuned parameter(s) to string to be written on file *) end

The AEOS module is implemented in such a way that brings little interference to the main Owl library. Code can be viewed in this pull request and has been merged into the main branch of Owl. You only need to switch the ENABLE_OPENMP flag from 0 to 1 in the dune file to try this feature.

To evaluate the performance of tuned OpenMP thresholds, we need a metric to compare them. One metric to compare two thresholds is proposed as follows. I generate a series of ndarrays, whose sizes grow by certain steps until they reach a given maximum number, for example, 1,000,000, used in the following experiment. Note that only input sizes that fall between these two thresholds are chosen to be used. I then calculate the performance improvement ratio of the OpenMP version function over the non-OpenMP version on these chosen ndarrays. The ratios are added up and then amortized by the total number of ndarrays. Hereafter, I use this averaged ratio as a performance metric.

Table 2-1 Tuned Parameters Using the AEOS Module
Figure 2-11
A bar plot illustrates the increase in percentage with fixed thresholds. Each bar is the ratio between the tuned and the chosen threshold values.

Improvement of the square root operation after applying parameter tuning

Table 2-1 presents the tuned threshold values of five operations on a MacBook with a 1.1GHz Intel Core m3 CPU and a Raspberry Pi 3B. We can see that they vary across different operations and different machines, depending on their computational complexity. For example, on MacBook, the tuning result is “max_int”, which means that for the relatively simple square root calculation, OpenMP should not be used, but that is not the case on Raspberry Pi. Also, note that the less powerful Raspberry Pi tends to get lower thresholds.

We also compare the tuned thresholds with a series of regular thresholds. Specifically, for each operation, we choose ten different thresholds with a fixed interval: 0, 100000, 200000… 900000. For each generated threshold, we use 100 numbers between 0 and 1E6 as ndarray sizes. They are also generated with a fixed interval. The execution time on the ndarrays of given sizes for each threshold are then compared with that of the tuned threshold, and the element-wise ratios between these two arrays can be plotted as a bar plot for each threshold. For example, the comparison for the square root operation on the MacBook is shown in Figure 2-11. Here, each bar indicates the ratio between the tuned and the chosen threshold. One hundred percent means these two are of the same effect on performance, and lower percentage means the tuned threshold leads to faster execution time. This figure shows that regardless of the choice of fixed thresholds, the tuned parameter can always lead to similar or better execution time of operations in the AEOS module.

Note that we cannot claim that the tuned parameters are always optimal, since the figure shows that in some rare cases where the improvement percentages are negative, the randomly found values indeed perform better. Also, the result seems to suggest that AEOS can provide a certain bound, albeit a loose one, on the performance improvement, regardless of the type of operation.

2.7 Summary

In this chapter, we focused on the optimization of core ndarray operations in Owl. We started by introducing the Ndarray module in Owl and its pivotal role in a numerical library and then introduced how we interface the OCaml code to the C language. The rest of this chapter mostly focused on optimizations at the C level. As an important background, we explained the principles in optimizing scientific computing code, such as utilizing parallelism of processors and locality of caches. Next, we briefly introduced some techniques based on these principles. As an example, we demonstrated how we apply some of them to optimize one of the most important operations in deep neural networks: the convolution. Finally, we briefly introduced the automatic tuning approach to optimize library performance across various platforms, using multicore parallel computing on Owl as an example.