In Chapter 2 we looked at how the Intel compiler may be used to build code for the Intel Xeon Phi coprocessor. This chapter will go more deeply and widely into the tools available for development on Intel Xeon Phi coprocessor. However, Intel tools have a lot of features that are outside of the scope of this book, which only focuses on the features relevant to Xeon Phi development. For a general introduction to the tools, please refer to the documentation installed with the tools themselves. The tools can be divided into four broad categories:

  1. 1.

    Development tools that let you build the code you write for Intel Xeon Phi and Intel Xeon processors and to debug issues with serial and parallel codes

  2. 2.

    Profiling and performance analysis tools that help you optimize the code you have written

  3. 3.

    Libraries that help you perform scientific computing efficiently on Xeon Phi

  4. 4.

    Cluster tools that allow you to run the applications on more than one node or within a node using message passing interface or other parallel methods

The following sections look at only a subset of the tools that are supported on Xeon Phi. Some of the tools such as Intel Inspector XE for checking multithreaded program correctness and performance analysis have not yet been ported to Xeon Phi as of time of writing.

The Application Development Tools

The application development tools allow the developer to compile and debug code for running programs on Intel Xeon Phi. Intel provides a compiler for the following languages that are relevant to Xeon Phi development. The version of the Intel compiler shipping as part of Intel Composer XE 2013 supports the following languages:

  1. 1.

    C/C++

    1. a.

      With offload extensions that includes upcoming OpenMP 4.0 specification.

    2. b.

      Intel proprietary offload extensions. Although there are a lot of commonalities between OpenMP 4.0 specification and the proprietary extensions, there are some useful features in these extensions that may not be included in the OpenMP 4.0 specifications or may be available later than the features already supported for the Intel compiler. These Intel compiler features are covered in this chapter.

    3. c.

      Thread Parallel programming support through Intel Cilk Plus, Intel Threading Building Blocks, OpenMP, and pthread.

  2. 2.

    Fortran

    1. d.

      OpenMP 4.0 and Intel proprietary Offload extensions as in C++

    2. e.

      Thread Parallel Programming support through OpenMP and Fortran 2008 Parallel Constructs (Coarray and Do Concurrent)

Intel C/C++ Composer XE

The C/C++ development tool is packaged as Intel Composer XE. This package includes the compiler, debugger, and the libraries. Chapter 2 covered some introductory material on how to use these tools to build the offload and cross-compiled version of the source for Intel Xeon Phi coprocessors. This chapter details the compiler-supported syntax and its usage with respect to Intel Xeon Phi.

In addition to generating code for Intel MIC Architecture, the Intel C++ Compiler can also generate code for 32-bit Intel Architecture instruction set architecture (IA-32) and Intel 64-bit–based applications on Intel-based Linux systems. IA-32 architecture applications (32-bit) can run on all Intel-based Linux systems, including Intel 64-bit processors. Intel 64-bit architecture applications are supported on Intel 64-bit-architecture-based Linux systems but not 32-bit Intel processor architecture. The compiler can be used in command shell form or in an integrated development environment  (IDE) environment such as Eclipse.

In order to support Xeon Phi programming, the Intel compiler supports the following components.

OpenMP 4.0 and Language Extensions

The Intel compiler has defined and implemented language extensions for C/C++ and Fortran to support offloading computation to Intel Xeon Phi coprocessors. With the release of OpenMP 4.0 to perform that same functionality, the Intel compiler has also implemented this standard to allow your code to take advantage of Xeon Phi coprocessors. Chapter 2 covered OpenMP 4.0 extensions by way of introducing offload programming for Xeon Phi.

This chapter specifically looks at the Intel proprietary extensions implemented in the Intel compiler for programming Xeon Phi. These extensions include keywords to support the computation offload and data transfer between Xeon Phi and the host, which support new compiler pragmas, keywords, macros, intrinsics, class libraries, and application programming interfaces. This chapter specifically covers the extensions for C/C++, but all of these have equivalent Fortran extensions as defined in the Intel Compiler users guide.

Pragmas

pragma offload

The pragma offload pragma allows you to copy in/out data and execute a block of code specified inside the pragma. The pragma greatly simplifies programming by requiring only changes to the code segment you would like to offload. There are some considerations and declarations that need to be made to make sure this block of code and all its functions in its call tree are made available to the coprocessor. This is accomplished by declaring them to be coprocessor executable through declarations. This is supported in both C++ and Fortran compilers and is the preferable offload model when the data are exchanged between a CPU and a coprocessor, but it is limited to arrays of bit-wise copyable elements or scalar data types.

Syntax

#pragma offload specifier[ specifier...]

{

  <expressions-statements>

}

Where specifier can be:

target ( target-name [ :target-number ]) where:

  • ‘target name’ must be mic,

  • ‘target-number’ is a signed an integer value having the following interpretations:

  • -1: The runtime system selects the target. If no coprocessor is available, the program fails with an error message.

  • >=0: the code will be executed on the (coprocessor number = target-number %(modulo) total number of coprocessors). For example, if you have two coprocessors, a number ‘4’ will signify mic0 (4 modulo 2) or the first coprocessor enumerated by the runtime system and so on.

  • If the ‘target-number’ is not specified, the runtime system determines which coprocessor to use for offload and falls back to the host execution if no coprocessor is available.

if-specifier: A Boolean expression. If true, the offload takes place according to the runtime decision on where to run the code depending on coprocessor availability. If false, it runs on the host processor.

signal ( tag ): An optional integer expression that is used with asynchronous data transfers or compute operations. In this case, the offload computation and data transfer happen concurrently with host computation code after the pragma offload code block. Without this clause, the offload operation is done synchronously, that is, the host code running on the thread causing the offload must wait for the offload code to return. However, you can still have a separate thread running in parallel on the host while the offload thread waits for the coprocessor computation to complete. Note that the ‘tag’ integer expression could be a pointer value as well. Thus you can use the address of a variable as a signal for an asynchronous offload.

wait ( tag[, tag, ...] ): An optional integer expression to wait on a previously initiated asynchronous offload call with signal specifier to complete. Note that both the signal and the wait clause must be associated with a target device and they must match. Otherwise, there may be a runtime abort.

mandatory: An optional clause to specify that the execution on the coprocessor is required and cannot fall back to a host processor if the coprocessor is not available. If the correct target hardware is not available, the execution will fail with an error message.

offload-parameter [, offload-parameter,..]: Describes the data transfer directions between host processor and the coprocessor for scalar and pointer data. For pointer data, one can specify the size of array to be bitwise copied to the target device. It is in one of these forms:

  • in(variable [,variable] [modifier [,modifier]]): Input to the target; the data are flowing only in one direction from host to the coprocessor.

  • out(variable [,variable] [modifier [,modifier]]): Output from the target; the data are flowing only in one direction from coprocessor to the host.

  • inout(variable [,variable] [modifier [,modifier]]): Both input and output to and from the target; this is the default if no in or out clause is specified.

  • nocopy(variable [,variable] [modifier [,modifier]]): This parameter allows the data previously copied from a previous offload call to the coprocessor to be reused without being sent back and forth between the host and the coprocessor.

The variables in the argument of the offload-parameters can be a C/C++ identifier, an array, or a segment of an array called an array slice, which is a contiguous memory area of an array and designated by a start index and the length of the segment. It is of the form:

variable [integer expression [: integer expression]]

[modifiers]: The modifiers as described in the definition of the offload-parameter can be one of the following:

  • length(integer expression): The length specifies the number of elements to be copied from the source object pointed to by the pointer variable or variable length array to or from the offload target variable. Note that because they are in disjoint memory regions, the pointer values are completely independent of each other between the host and the coprocessor.

  • alloc_if(Boolean condition)|free_if(Boolean condition): These constructs allow data persistency on the coprocessor.

  • alloc_if controls the allocation of a new block of memory on the target when the offload is executed on the target. If the Boolean condition evaluates to true, a new memory allocation is performed for each variable listed in the clause. If false, the existing allocated memory blocks on the target are reused.

  • free_if controls the deallocation of memory allocated for the pointer variables in an in clause. If the Boolean condition is true, the memory pointed to by each variable listed in the clause is deallocated. If false, no action is taken on the memory pointed to by the variables in the list (i.e., the memory allocated on the target remains intact and can be reused in a subsequent offload call). These two modifiers work in conjunction to provide persistent memory constructs for Xeon Phi processors. If the alloc_if or free_if is omitted, the default assumes alloc_if and free_if is invoked with the Boolean condition set to true. This way each invocation allocates and frees memory on entering and exiting the offloaded block of code.

  • align(expression): This modifier applies to the pointer variable and requests runtime to allocate memory on the coprocessor to be aligned to size as computed by the integer expression. The expression must evaluate to a number that is the power of two.

  • alloc(array_slice): Array_slice is a set of elements of the array that needs allocation. The array_slice must be contiguous and of the form (start_element_index:length). Only the portion of the array specified by the in or out expression is transferred, thus reducing the transfer bandwidth requirement. When the array slice has more than one dimension, the second and subsequent index expression must specify all elements of that dimension. For example, in #pragma offload  in (data[10:100]:alloc(data[5:1000])), the modifier will allocate 1000 elements on the coprocessor in the index range 5-1004 and copy elements data[10] through data[109] to the target index location 10-109. The first usable array index on the target is 5, as specified by the data_slice expression.

  • into (var-exp): By default the variable names used in the offload parameter are the same for the CPU and coprocessor side. You can transfer data from one variable in the host processor to a different variable name in the target coprocessor. For example, using #pragma offload in(var1[10:100] : into(var2[100:100])) will copy data from the var1 array on the host side to the var2 array on the coprocessor side.

Offload Execution Process

The execution of an offload process by the compiler runtime is as follows:

  1. 1.

    If there is an ‘if,’ evaluate the if expression. If it evaluates to false, execute the region on the host CPU and you are done; if not, continue with the following steps.

  2. 2.

    Acquire the coprocessor. If the target is not available to run the offload code, execute the region on the host CPU and you are done; if the target is acquired, continue.

  3. 3.

    Evaluate alloc_if, free_if, and length expressions used in ‘in’ and ‘out’ clauses.

  4. 4.

    Gather all input values and send them to the coprocessor.

  5. 5.

    On the target coprocessor, allocate memory for pointer based on out variables.

  6. 6.

    On the target coprocessor, copy input variable values into the corresponding target variables. This includes scalar variables, arrays, and pointer data.

  7. 7.

    Execute the offloaded region on the coprocessor.

  8. 8.

    On the coprocessor, compute all length expressions used in out clauses.

  9. 9.

    On the coprocessor, gather all variable values that are outputs of the offload.

  10. 10.

    Send output values back from the target to the host CPU.

  11. 11.

    On the host, copy the output values received into the corresponding host CPU variables.

pragma offload_attribute

The offload_attribute pragma is used to declare code or data sections that need to be available on the coprocessor. This goes hand in hand with the ‘pragma offload’ to be used for function and variable declarations. It specifies that all functions and variables declared following this pragma are made available on the coprocessor by generating and allocating appropriate functions codes and data on the coprocessor.

You can also use declspec to define the variables and functions that should be available on the Xeon Phi coprocessor, but that requires you to use that for each individual variable or function declaration. However, with the pragma offload_attribute pragma given below, you can declare a block of variables or functions to be available on the coprocessor environment.

Syntax

#pragma offload_attribute([push, ] target(target-name))

<declarations>|<definitions>

#pragma offload_attribute(pop|{target(none)})

Where:

  • ‘target name’ must be mic

  • push: All function and variable declarations and definitions are targeted for Intel MIC architecture until the statement #pragma offload_attribute(pop|{target(none)}) or end of compilation unit is reached.

  • pop/target(none): Turns off pragma.

The following example shows how to declare a block of functions and variables so that the compiler can have them available to codes executing on the coprocessor:

#pragma offload_attribute (push, target(mic))

<function/variable declarations>

#pragma offload_attribute (pop)

pragma offload_transfer and pragma offload_wait

The offload _transfer and offload_wait pragmas are used to perform a synchronous or asynchronous data transfer between the host and Xeon Phi coprocessor. Offload_wait is specifically used for asynchronous data transfer and waits for any previously initiated transfer to complete. This clause is designed to control data transfer between the host and the coprocessor; as such the clause does not have any execution code block as was provided with pragma offload.

Syntax

#pragma offload_transfer specifier[ specifier...]

#pragma offload_wait specifier[, specifier...]

Where the ‘specifier’ is the same as defined for ‘pragma offload’ above and reproduced for easier reference.

target ( target-name [ :target-number ]) where:

  • ‘target name’ must be mic,

  • ‘target-number’ is a signed integer value having the following interpretations:

  • -1: The runtime system selects the target. If no coprocessor is available, the program fails with an error message.

  • >=0: The code will be executed on the (coprocessor number = target-number %(modulo) total number of coprocessors). For example, if you have two coprocessors, the number ‘4’ will signify mic0 (4 modulo 2) or the first coprocessor enumerated by the runtime system and so on.

  • If the ‘target-number’ is not specified, the runtime system determines which coprocessor to use for offload and falls back to the host execution if no coprocessor is available.

if-specifier: A Boolean expression. If true, the offload takes place according to the runtime decision on where to run the code depending on coprocessor availability. If false, it runs on the host processor.

signal ( tag ): An optional integer expression that is used with asynchronous data transfers or compute operations. In this case, the offload computation and data transfer happen concurrently with the host computation code after the pragma offload code block. Without this clause, the offload operation is done synchronously, that is the host code running on the thread causing offload must wait for the offload code to return. However, you can still have a separate thread running in parallel on the host while the offload thread waits for the coprocessor computation to complete. Note that the ‘tag’ integer expression could be a pointer value as well. Thus you can use the address of a variable as a signal for asynchronous offload.

wait ( tag[, tag, ...] ): An optional integer expression to wait for a previously initiated asynchronous offload call with a signal specifier to complete. Note that both the signal and the wait clause must be associated with a target device and they must match, otherwise, there may be a runtime abort.

mandatory: An optional clause to specify that the execution on the coprocessor is required and cannot fall back to the host processor if the coprocessor is not available. If the correct target hardware is not available, the execution will fail with an error message.

offload-parameter [, offload-parameter,..]: Describes the data transfer directions between the host processor and the coprocessor for scalar and pointer data. For pointer data, you can specify the size of the array to be bitwise copied to the target device. It is in one of these forms:

  • in(variable [,variable] [modifier [,modifier]]): Input to the target. The data are flowing only in one direction from host to the coprocessor.

  • out(variable [,variable] [modifier [,modifier]]): Output from the target. The data are flowing only in one direction from coprocessor to the host.

  • inout(variable [,variable] [modifier [,modifier]]): Both input and output to and from the target. This is the default if no in or out clause is specified.

  • nocopy(variable [,variable] [modifier [,modifier]]): This parameter allows the data previously copied from a previous offload call to the coprocessor to be reused without being sent back and forth between the host and the coprocessor.

The variables in the argument of the offload parameters can be a C/C++ identifier, an array, or a segment of an array called an array slice, which is a contiguous memory area of an array and designated by a start index and the length of the segment.

variable [integer expression [: integer expression]]

[modifiers] The modifiers as described in the definition of the offload parameter can be one of the following:

  • length(integer expression): The length specifies the number of elements to be copied from the source object pointed to by the pointer variable or variable length array to or from the offload target variable. Note that because they are in disjoint memory regions, the pointer values are completely independent of one another between the host and the coprocessor.

  • alloc_if(Boolean condition)|free_if(Boolean condition): These constructs allow data persistency on the coprocessor.

  • alloc_if controls the allocation of a new block of memory on the target when the offload is executed on the target. If the Boolean condition evaluates to true, a new memory allocation is performed for each variable listed in the clause. If false, the existing allocated memory blocks on the target are reused.

  • free_if controls the deallocation of memory allocated for the pointer variables in an in clause. If the Boolean condition is true, the memory pointed to by each variable listed in the clause is deallocated. If false, no action is taken on the memory pointed to by the variables in the list (i.e., the memory allocated on the target remains intact and can be reused in a subsequent offload call). These two modifiers work in conjunction to provide persistent memory constructs for Xeon Phi processors. If the alloc_if or free_if is omitted, the default assumes alloc_if and free_if is invoked with Boolean condition set to true. This way each invocation allocates and frees memory on entering and exiting the offloaded block of code.

  • align(expression): This modifier applies to pointer variables and requests runtime to allocate memory on the coprocessor to be aligned to the size computed by the integer expression. The expression must evaluate to a number that is the power of two.

  • alloc(array_slice): array_slice is a set of elements of the array that needs allocation. The array_slice must be contiguous and of the form (start_element_index:length). Only the portion of the array specified by the in or out expression is transferred, thus reducing the transfer bandwidth requirement. When the array slice has more than one dimension, the second and subsequent index expressions must specify all of the elements of that dimension. For example, the #pragma offload in (data[10:100]:alloc(data[5:1000])) modifier will allocate 1000 elements on the coprocessor in the index range 5-1004 and copy elements data[10] through data[109] to the target index location 10-109. The first usable array index on the target is 5, as specified by the data_slice expression.

  • into (var-exp): By default the variable names used in the offload parameter are the same for the CPU and coprocessor side. You can transfer data from one variable in the host processor to a different variable name in the target coprocessor. For example, using #pragma offload in(var1[10:100] : into(var2[100:100])) will copy data from the var1 array on the host side to the var2 array on the coprocessor side.

For pragma offload_wait, only the following specifiers are valid:

    target ( target-name [ :target-number ])

    if ( if-specifier )

    wait ( tag[, tag ...] )

Asynchronous Data Transfer Over PCI Express

The attribute offload_transfer with the implementation of a signal provides a way to transfer data asynchronously. The tag in the signal specifier is associated with the data to be transferred. The complementary wait specifier usually follows the some host computation and causes code execution at that point to wait for the data transfer initiated with signal clause to complete.

The wait clause must be executed after the signal clause specific transfer is initiated and must be enforced if the wait call is executed in a different thread. Otherwise, a runtime program abort will happen.

The source code in Listing 8-1 shows how to perform an asynchronous data transfer by using the offload_transfer pragma. To test the asynchronous transfer, we allocate 64MB of memory (memory pointer a) on the host at line 51. We also preallocate data on the coprocessor at line 60-63 so that the transfer time does not include the data allocation time. Listing 8-2 shows the output of a run of this code on a host with the Xeon Phi processor.

Listing 8-1. An example of asynchronous data transfer using Intel compiler offload pragmas

34. #include <stdio.h>

35. #include <stdlib.h>

36. #include <omp.h>

37.

38. //Define number of floats for 64 MB data transfer

39. #define SIZE (64*1000*1000/sizeof(float))

40. #define ITER    1

41. // set cache line size alignment

42. #define ALIGN   (64)

43. __declspec(target(MIC)) static float  *a;

44. extern double elapsedTime (void);

45. int main()

46. {

47.         double startTime,  duration;

48.         int i, j;

49.

50.         //allocate a

51.         a = (float*)_mm_malloc(SIZE*sizeof(float),ALIGN);

52.

53.         //initialize arrays

54.         #pragma omp parallel for

55.         for (i=0; i<SIZE;i++)

56.         {

57.                 a[i]=(float)1.0f;

58.         }

59.         // Allocate memory on the card

60.         #pragma offload_transfer target(mic:0) \

61.           in(a:length(SIZE) free_if(0) alloc_if(1) align(ALIGN) )

62.

63.

64.         // test synchronous transfer time

65.         startTime = elapsedTime();

66.             //transfer data over the PCI express bus

67.            #pragma offload_transfer target(mic:0) \

68.             in(a:length(SIZE) free_if(0) alloc_if(0) align(ALIGN) )

69.

70.         duration = elapsedTime() - startTime;

71.         printf("Synchronous data transfer time %lf milliseconds \n",duration*1000);

72.         // test asynchronous transfer time

73.         startTime = elapsedTime();

74.             //transfer data over the PCI express bus

75.     #pragma offload_transfer target(mic:0) \

76.     in(a:length(SIZE) free_if(0) alloc_if(0) align(ALIGN) ) signal(a)

77.

78.         duration = elapsedTime() - startTime;

79.         printf("Asynchronous data transfer time after start of the transfer %lf milliseconds \n",duration*1000);

80.         // test asynchronous transfer time

81.         startTime = elapsedTime();

82.              #pragma offload_wait target(mic:0) wait(a)

83.

84.         duration += elapsedTime() - startTime;

85.         printf("Asynchronous data transfer time with wait clause %lf milliseconds \n",duration*1000);

86.         // free memory on the card

87.         #pragma offload_transfer target(mic:0) \

88.                 in(a:length(SIZE) alloc_if(0) free_if(1) )

89.

90.

91. //free the host system memory

92.         _mm_free(a);

93.         double GB = SIZE*sizeof(float)/(1000.0*1000.0*1000.0);

94.         double GBps = ITER*GB/duration;

95.         printf("SP ArraySize =  %0.4lf MB, ALIGN=%dB, PCIe Data transfer bandwidth Host->Device  GB/s = %0.2lf\n", GB*1000.0, ALIGN,  GBps);

96.   return 0;

}

Listing 8-2. Output from the Code Running on a Host with Xeon Phi Coprocessor

  • ./pciebw.out

Synchronous data transfer time 9.397984 milliseconds

Asynchronous data transfer time after start of the transfer 2.367973 milliseconds

Asynchronous data transfer time with wait clause 9.452820 milliseconds

SP ArraySize =  64.0000 MB, ALIGN=64B, PCIe Data transfer bandwidth     Host -> Device  GB/s = 6.77

At lines 67 through 69 of the source code, the code performs a synchronous transfer by invoking ‘offload_transfer’ without the signal command:

#pragma offload_transfer target(mic:0) \

   in(a:length(SIZE) free_if(0) alloc_if(0) align(ALIGN) )

The output shows that the synchronous transfer took approximately 9.39 milliseconds.

Then the code performs an asynchronous transfer at lines 75 and 76 by invoking ‘offload_transfer’ with a signal(a) clause. This causes the code to start the transfer and return almost immediately. It takes approximately 2.36 milliseconds to do this, as shown in Listing 8-2. Finally, the code waits for the transfer to complete with pragma offload_wait with the wait(a) clause in line 82. We see from Listing 8-2 that the total time to start the asynch transfer plus wait time is 9.45 milliseconds, close to the synchronous transfer time. You can do some computing on a host on the same thread in between the start of the transfer and the offload_wait pragmas.

Keywords

There are two keywords: _Cilk_offload and _Cilk_shared. _Cilk_offload allows you to offload a function to a card. You can use _Cilk_offload with _Cilk_spawn to perform an asynchronous offload. What if you want to execute a loop on the Xeon Phi rather than a function? In order to run a loop rather than a function, you can use _Cilk_offload with the _Cilk_for construct. The _Cilk_shared clause is used to declare the functions or data to be shared between the host processor and the coprocessor. For example, before calling _Cilk_offload to offload a function, you need to declare them with a _Cilk_shared clause. The memory declared with _Cilk_shared is at the same virtual address on the host and coprocessor.

I will not be covering syntax of these keywords here, but you can find those details in the Intel C++ Compiler XE users and reference guides installed as part of the Intel Composer XE package.

Using Shared Virtual Memory

We have seen in pragma offload section above that the data need to be bitwise copyable in order to be shared between the host and the coprocessor. However, Intel Compiler C++ runtime library provides a shared virtual memory model where the data do not need to be bitwise copyable. For example, it can use pointers or composition of ‘C’ structures and pointers. Pointers to a shared variable have the same value, although they point to different virtual address spaces on the host processor and the coprocessor. This allows offload code to work on the linked data structures as the pointer values are preserved in their respective disjoint virtual address space. Here the runtime maintains the data consistencies between the copies residing on the host and the coprocessor(s).

You need to use the _Cilk_shared clause for declaring such data. There are also library functions, _Offload_shared_malloc(), _Offload_shared_aligned_malloc(), _Offload_shared_free(), and _Offload_shared_aligned_free() that allow you to allocate space for such shared objects. The data in the shared virtual address are synchronized between the host and the coprocessor at the following points during program executions:

  1. 1.

    When the offload function is invoked on the host and upon entering the offload function on the target coprocessor.

  2. 2.

    On return from the offload execution call to the host processor.

No other synchronization point exists, so simultaneous access to the shared memory location outside these sync points creates a race condition.

Valid Use of the Keywords

The following are some of the rules for using the _Cilk_shared and _Cilk_offload keywords:

  • Apply them to C++ classes. In this case all the member functions of the class and the objects instantiated of that class are shared.

  • Apply them to static fields of a class.

  • Assign an address of a shared variable to a shared pointer.

  • The functions called directly or indirectly by _Cilk_offload must be declared as _Cilk_shared or pointer-to-shared.

  • Pointer arguments passed to an offload function must be pointer-to-shared.

  • Global variables and functions referenced within _Cilk_offload must be marked _Cilk_shared.

  • _Cilk_shared cannot be used on a field in a structure, a static variable, or a local variable.

Macros

The functions or code fragment of the offload functions can be designated to be compiled for Xeon Phi or MIC coprocessor only. This is useful when you want to use some intrinsics or assembly instructions that are only valid for the Xeon Phi instruction set and will not compile for the host. This is done through the macro __MIC__ recognized by the compiler and designated as the part of the code that should be built for Xeon Phi only. Do not use it inside a #pragma offload statement, but rather inside a function offloaded to the Xeon Phi card. The reason for not using it inside a #pragma offload is that the Intel compiler goes through two different phases while compiling the source code. In the first phase, it compiles for the host, and in the second phase, it compiles for Xeon Phi. In some cases where a variable is defined on the host side, it is only used inside an offload region bracketed by the __MIC__ macro and is not explicitly passed by the pragma offload clause. The variable will not be sent in when the code is compiled for the host. This is because the __MIC__ macro is undefined in this phase and removes the code that uses this variable. As a result, the host side code will not send in the variable. However, the code version created for Xeon Phi will need this variable, which is not available on the coprocessor and will cause the runtime abort as the variable is not sent in by the code running on the host.

Listing 8-3 shows the effect of using the __MIC__ macro inside an offload region directly. Listing 8-4 shows the host version of the code in Listing 8-3. Listing 8-5 shows the coprocessor version of the code in Listing 8-3. Listing 8-6 presents the output for the code in Listing 8-3.

Listing 8-3. Wrong Use of the __MIC__ Macro

int main()

{

   printf("y = %d\n", f());

}

int f()

{

  int x=1, y=3;

  #pragma offload target(mic)

  {

      #ifdef __MIC__

          y = x;

      #endif

          y++;

  }

return y;

}

Listing 8-4. Host Version of Listing 8-3

//host version of f()

int f()

{

  int x, y;

  #pragma offload target(mic)

  {

          y++; // only sends in Y variable, x is not sent in.

  }

}

Listing 8-5. Coprocessor Version of the Code in Listing 8-3

// coprocessor version of f()

int f()

{

  int x, y;

  #pragma offload target(mic)

  {

       y = x; // Causes error as x is not sent in from the host side

  }

}

// only sends in Y variable, x is not sent in.

Listing 8-6. Ouput of the Code in Listing 8-3

offload error: unexpected number of variable descriptors

offload error: process on the device 0 unexpectedly exited with code 1

This macro is only defined by the compiler while compiling coprocessors code. It is undefined when you use the '-no-offload' compile switch. If you use the '-mmic' command line option, which instructs the compiler to build for the coprocessor only, this macro is defined. This macro will be true for all future MIC architecture. If you want to build for Intel Xeon Phi architecture but not for any future architecture, you may want to use the __KNC__ macro. The __INTEL_OFFLOAD macro is used when the code should not be built when the target is host processor only executable. It is defined by the compiler when building offload code for the host processor and the coprocessor but undefined when you use the '-mmic' or '–no-offload' compile switch.

Intrinsics

Intrinsics allow you to use C/C++ functions and variables for readability instead of assembly codes in your source code. These are necessary where there are no corresponding C/C++ language constructs to perform the same work as intended. When used inside a code, they are inlined by the compiler, thus removing the function call overhead. New intrinsics that correspond to Xeon Phi coprocessors instruction set (Intel Initial Many Core Instructions) have been defined on top of existing intrinsics implemented in Intel Compilers. These intrinsics for Xeon Phi architecture enable you to use the vector processing capabilities of Xeon Phi processors.

There are two versions of these intrinsics: one for the nonmasked and the other for masked operations of the corresponding vector operations. The intrinsics are defined in immintrin.h and need to be included in your source file where you want to use them.

Xeon Phi intrinsics operate on vectors. There are 32 vectors (v0-v31), each having 512 bits corresponding to underlying hardware registers of Xeon Phi coprocessor. These vectors are represented by the new proprietary __m512 (single precision vector), __m512i(int32/int64 vector), and __m512d (double precision vector) data types. The compiler aligns __m512 local and global data types to 64-byte boundaries on the stack. Note that you cannot operate these data types through arithmetic instructions such as +, –, and so forth. You must use these on either side of an assignment, a return value, or as a parameter in a statement. These can only be used with the intrinsics defined for these data types. You can use the data type as a union to access individual elements.

These instructions operate on the same memory address space as the standard Intel 64-bit instructions, using the vectors for packed data sources and results. Vector mask support is provided by eight vector mask registers and allows conditional execution over 16-SP or 8-DP elements in a vector instruction.

Intel compiler also provides a set of intrinsics for Xeon Phi coprocessors that are vector variants of the corresponding scalar math operations. These are referred to as the Short Vector Math Library (SVML). They take vector elements and perform scalar math operations on each element of the source vectors. The result of the operation is returned in a vector. The supported SVML operations include the following for both single precision, double precision, and integers where applicable:

  • Division operations: quotient of a division, reminder of division operation

  • Error function operations: inverse cumulative distribution, error functions, complementary error function, inverse error function

  • Exponential operations: Exp10, Exp2, Expe, exponential value of one argument raised to another

  • Logarithmic operations: Log10, Log2, Log-natural logarithm, calculate the signed exponent

  • Rounding operations: ceiling (ceil), floor, round off to nearby integer, round off to nearest even integer, truncate to nearest integer not larger in the absolute value

  • Square root and cube root operations: sqrt, invsqrt, hypotenuse, cuberoot

  • Trigonometric operations: various categories of sines, cosines, tans, and their inverses

Please refer to the Intel C++ Compiler XE users and reference guide provided with the Intel Composer XE package for details on the available intrinsics.

C++ Class Libraries

In addition to the intrinsics defined above, C++ class libraries have been defined to abstract the 512-bit vector operations on Xeon Phi for several math routines. These libraries provide similar functionality to that provided by Intel C++ libraries for SSE2 instructions on Intel Xeon processors. The following C++ classes are implemented to support operations on Xeon Phi vectors objects and declared in the micvec.h header file:

  • F64vec8: 8 elements of 64-bit signed double precision vector class

  • I64vec8: 8 elements of 64-bit-long integer vector class

  • F32vec16: 16 elements of 32-bit single precision float vector class

  • I32vec16: 16 elements of 32-bit integer vector class

  • Is32vec16: 16 elements of signed 32-bit integer vector class

  • Iu32vec16: 16 elements of unsigned 32-bit integer vector class

For example, if you want to add two 16-element 32-bit SP vectors and generate a 16-element 32-bit vector, you can use the following code:

F32vec16 A,B,C;

// initialize the vector elements

C = A + B; // results addition of A and B to get the results in C

Application Programming Interfaces

Intel Compiler implements a set of application programming interfaces (APIs), which are equivalent to host processor APIs but targeted toward the coprocessor and can be executed inside the code running on the host to set the coprocessor runtime environment. Each of these APIs take two additional arguments: target_type, which can be TARGET_NONE, TARGET_MIC, or TARGET_HOST, and target_number, which is a signed integer and interpreted as defined in the pragma offload section earlier in this chapter. You can set the target_type to DEFAULT_TARGET_TYPE, which is set to TARGET_MIC, and target_number to DEFAULT_TARGET_NUMBER, which is set to 0. For example, omp_set_num_threads(num_threads) is an OpenMP API to set the number of execution threads for OpenMP applications. There is an equivalent API for Xeon Phi called omp_set_num_threads_target(target_type, target_number, num_threads) to set the number of coprocessor threads.

A set of offload APIs provides control and memory allocation during runtime on the coprocessors. These APIs include _Offload_number_of_devices, _Offload_get_device_number, _Offload_get_physical_device_number, _Offload_shared_malloc, _Offload_shared_free, _Offload_shared_aligned_malloc, _Offload_shared_aligned_free, _Offload_signaled, _Offload_report, omp_set_device_num, and omp_get_device_num. There are many more APIs that are not covered here. These APIs are defined within the offload.h header file shipped with the Intel Compiler.

Environment Variables

Once an application is built with offload extensions and run on a host with Xeon Phi coprocessor, its behavior can be controlled and logged with the help of environment variables on the host. For example, you can set the number of threads running on the coprocessor using the environment variables presented in the following sections, which may be different from what is set on the host.

MIC_ENV_PREFIX

The execution environment on the coprocessor itself can be modified in the same way it can be done on the host environment. For example, you can set OMP_NUM_THREADS to set the number of OpenMP threads to be run on the host and coprocessor. The problem may arise if you want to set the number of threads differently on the host and the coprocessor. MIC_ENV_PREFIX provides such a facility to differentiate between the host and coprocessor environment variables. By default all environment variables set on the host are passed to the coprocessor. Using MIC_ENV_PREFIX, you can set the prefixes so that the environment variables with this prefix are only passed to the coprocessor runtime environment. For example, setting MIC_ENV_PREFIX=MIC will allow you to set MIC_OMP_NUM_THRADS=240 and OMP_NUM_THREADS=16. This allows you to run OpenMP applications with different numbers of threads on the host processor and the Xeon Phi coprocessor. The environment variable that is set on the coprocessor will have the MIC prefix removed, that is it will be set to OMP_NUM_THREADS on the coprocessor.

When you have multiple coprocessor in a host and you want to specify environment variables specific to a coprocessor, you can do so by using mic_prefix_card_number_var=value. For example, to set the number of OpenMP threads on coprocessor 2 to 60 instead of 240, you can set the environment variable on the host to MIC_2_OMP_NUM_THREADS=60. Other environment variables supported by Intel Compiler to support coprocessor execution are listed in the sections that follow.

OFFLOAD_REPORT

Outputs offload execution time and bytes exchanged between the host and coprocessor. This environment variable is equivalent to using the __Offload_report API.

The supported values are:

  • 1: Produces a report about time taken for offload work

  • 2: Produces time taken and bytes transferred

  • 3: Add details of offload activity

  • None: No output produced (default)

MIC_LD_LIBRARY_PATH

Specifies the location for a target-shared object on the host. This is required by the compiler runtime to discover the dependent dynamic library files for the offloaded code to be sent over to the coprocessor for successful execution. This is set by default when you set up the Composer XE environment variable through compilervars.sh.

MIC_PROXY_IO

Because MIC devices are not directly connected to the I/O devices on the host, they are to be proxied by the compiler runtime environment. This environment variable enables (1) or disables (0) the proxy of stderr or stdout to the coprocessor. If enabled, which is the default setting, the “printf” and other outputs to stdout or stderr devices will be reflected on the host from the code executing on the coprocessor.

MIC_STACKSIZE

This environment variable is used to specify the stack size of the main thread for the offload. It corresponds to ulimit -s (BASH shell) or limit stacksize (C shell) on the host Linux environment.

Syntax: MIC_STACKSIZE = integerB|K|M|G|T, where the integer is the value of the stack size requested and B, K, M, G, and T stand for bytes, kilobytes, megabytes, gigabytes, and terabytes, respectively. The default stack size on the coprocessor is 12M.

OFFLOAD_DEVICES

This environment variable restricts the offload process to coprocessors listed as comma-separated list values of this environment variable. The devices are numbered from 0 to the number of devices on system-1. Setting OFFLOAD_DEVICES=0,2 will restrict offloading to device number 0 (first) and device number 2 (3rd) coprocessor in a system with three or more coprocessors.

OFFLOAD_INIT

Specifies the runtime and when to initialize the coprocessors in a system. The values are on_start, where all the coprocessors are initialized upon entering the main, on_offload, where specific coprocessors are initialized immediately before the offload to those specific coprocessors, and on_offload_all(default), where all available coprocessors, regardless of whether code will be offloaded to them or not, are initialized immediately before the first offload in an application.

Compiler Options

Because the same code can be used on hosts with or without Xeon Phi coprocessors, Intel compiler defined various compiler switches to control the compilation of the code for the coprocessors. For example, you can turn off the offload compilation portion of the source code with a switch. Also the compiler can provide various diagnostic messages related to offload extensions and their behavior with appropriate switches, as follows:

  • -mmic: This compiler switch is used to cross-compile the source for Xeon Phi targets. Binary code produced with this switch runs natively under the coprocessor OS.

  • -no-offload: This allows you to build the source ignoring pragmas related to Xeon Phi offload. Thus you can make sure the code generated runs only on the host processor.

  • -offload-attribute-target: This causes the compiler to make all the file scope functions and the data objects in the source file available for execution on Xeon Phi. This achieves equivalent functionality by using offload attribute target(mic) with all of these functions and data in the source code.

  • -offload-option: Provides options to be used for building Xeon Phi specific codes.

  • -opt-report-phase=offload: Allows you to generate an optimization report during offload specific compilation. It prints out the input and output variables the host sends to the offload target and variables that the host receives from the target.

Creating Offload Libraries

You can use xiar or ‘xild –lib’ to create static libraries with offload code to be used in offload applications. In order to do so, you need to first create a library using –offload-build option with xiar or ‘xild –lib’. This will create two versions of the library: one for the CPU, lib.a, and one for the MIC, libMIC.a. For example, xiar –offload-build rcs libtest.a obja.o objb.o will create two libraries: libtest.a containing obja.o, objb.o, and libtestMIC.a containing objaMIC.o, objbMIC.o objects. When linking in these libraries with your application, use the –ltest option, which will cause the compiler to link in appropriate libraries for the host and Xeon Phi.

Intel Fortran Composer XE

Intel Fortran compiler allows you to write code that runs natively on Intel Xeon Phi coprocessor or build host code with part of execution offloaded to Xeon Phi coprocessor. Many of the supports described in the C/C++ compiler for Xeon Phi are also available on the Fortran compiler, which are relevant to Fortran language programmers. Because the semantics are the same for Fortran as they are for C++, you may refer to the C/C++ sections earlier in this chapter for detailed descriptions of these constructs.

Directives

Directives are equivalent to C/C++ pragmas. Please refer to the Intel Compiler XE users and reference guide installed with the compiler for syntax of these directives. The explanations of these directives are the same as those described in the C/C++ section earlier and will not be covered here. Data exchanges with these offload directives are scalars, arrays, and Fortran-derived types, which are bitwise copyable.

A brief description of the directives supported by Intel Fortran compiler follows:

ATTRIBUTES OFFLOAD directive: This directive is used to specify the variables and procedure that should be made available on the coprocessor. All procedures called from the procedure marked with ATTRIBUTES OFFLOAD should also be defined with the same directive so they are available on the coprocessor. The compiler will issue warnings for procedures and data referenced in the offload section that are not marked with the ATTRIBUTE OFFLOAD directive.

OFFLOAD directive: This directive transfers data and executes the statements and directives following this directive on the coprocessor. The statement following the OFFLOAD directive must be one of the following, which are executed on the coprocessor:

  • An OpenMP PARALLEL, PARALLEL SECTIONS or PARALLEL DO directives

  • A CALL statement

  • An assignment statement where the right-hand side only calls a function

OFFLOAD_TRANSFER/OFFLOAD_WAIT directive: Used for asynchronous data transfer between the host and the coprocessor. Usage is the same as defined in the C/C++ section above.

OFFLOAD BEGIN and END OFFLOAD directive: Causes a group of statements bracketed by these directives to execute on the coprocessor. This is similar to the OFFLOAD directive but allows you to offload a block of code. However, you cannot have OpenMP directives inside the code block within these two directives, where you would need to use the OFFLOAD directive only.

Macros

Intel Fortran compiler also supports __MIC__, __KNC__, and __INTEL_OFFLOAD macros defined in the C/C++ section earlier in this chapter.

Application Programming Interfaces

The APIs are provided as part of the mic_lib.f90 file. These functions allow you to deal with multiple coprocessors, calling functions on the CPU to modify the coprocessor’s execution environment.

Environment Variables, Compiler Options, and Creating Static Libraries

Please refer to corresponding sections described earlier in this chapter related to C/C++ compiler.

Third-Party Compilers Supporting Xeon Phi

There are few third-party compilers that support the Xeon Phi coprocessor. The main issue with compiler support for Xeon Phi is the lack of standards for supporting offload compilation. With the standardization of OpenMP 4.0, more compilers will support the Xeon Phi coprocessor.

CAPS Compiler

CAPS compilersFootnote 1 OpenCL backend supports code generation for Xeon Phi coprocessor through OpenHMPP and OpenACC directives.

Debugging Xeon Phi Applications

There are two solutions available for debugging applications with code running on Xeon Phi coprocessors:

  1. 1.

    Intel Debugger with Eclipse IDE (integrated design environment) and command line interface

  2. 2.

    GNU debugger for Xeon Phi coprocessor

Intel Debugger

Intel debugger (IDB)has been extended to support Intel Xeon Phi architecture for debugging C++ and Fortran programs in the command line or graphical user interface (GUI) running on the host system. IDB allows you to debug both the offloaded as well as the native applications. For native applications, you need to attach to a process running on the coprocessor.

The debugger for Xeon Phi coprocessor support is modified to consist of two loosely coupled debuggers. The host debugger (idbc) component runs on the host. It launches and debugs the offload application built with Xeon Phi support. Under the GUI environment the target debugger (idbc_mic) is launched by the host debugger when it encounters an offload directive to start executing on the coprocessor by attaching to the offloaded process. For applications that run natively on the Xeon Phi, you can use the target debugger by attaching to the process running natively on the coprocessor, or you can start the application with the target debugger. The auto-attach feature is not available when using the command line version of the debugger. Under Eclipse IDE, when debugging offload code, the debugger automatically switches to view and control code running on the Xeon Phi coprocessor. The behavior of the IDB for debugging the offload code can be describes as follows:

  1. 1.

    The host debugger starts the main application process.

  2. 2.

    When an offload pragma or declaration is encountered in C++ or Fortran, the offload process gets launched on the coprocessor.

  3. 3.

    The debugger remotely attaches to the offloaded process running on the Xeon Phi through a debug agent downloaded to the coprocessor, and it communicates to the host debug process through the TCP/IP.

  4. 4.

    Once the offload process finishes, the view changes to host process.

  5. 5.

    Once the host process has finished, the offload process is removed and the debugger detaches from the offload process.

Please refer to the Intel debugger users and reference guide for the details on using Intel debugger for debugging Xeon Phi applications.

Third-Party Debuggers

There are several third-party debuggers that support debugging codes built for Intel Xeon Phi coprocessor. Some of them are discussed below.

GNU Debugger

The GNU debugger (GDB) supports the Intel Xeon Phi coprocessor and is provided as part of the MPSS package. The support for Xeon Phi is provided through native and cross-remote debug versions. It supports C/C++ and Fortran and Parallel Debug Support (PDBX). There are three components to the GDB for Xeon Phi. Two of the components are for host-based debugging, that is you start the debugging session on the host, and the other one is for Xeon Phi native debugging. The host-based debugging components are the debugger (x86_64-k1om-linux-gdb) and the debug server (gdbserver) running on the Xeon Phi. The native debugger is gdb.

The debugger allows you to run gdb commands on the coprocessor so you can obtain coprocessor-related information. For example, you can view all the vector registers and masks using the command “info registers zmm.” Listing 8-7 is the output of such a command executed in the gdb running natively on Xeon Phi coprocessor.

Listing 8-7. Output of “info registers zmm” Command in Xeon Phi Native Version of GDB

(gdb) info register zmm

k0             0x0      0

k1             0x0      0

k2             0x0      0

k3             0x0      0

k4             0x0      0

k5             0x0      0

k6             0x0      0

k7             0x0      0

zmm0           {v16_float = {0x0 <repeats 16 times>}, v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v64_int8 = {

    0x0 <repeats 64 times>}, v32_int16 = {0x0 <repeats 32 times>}, v16_int32 = {0x0 <repeats 16 times>}, v8_int64 = {0x0,

    0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_uint128 = {0x00000000000000000000000000000000,

    0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000}}

TotalView

TotalView by Rogue Wave Software is a popular commercial debugging tool for single and multithreaded application and provides support for debugging Intel Xeon Phi coprocessor.Footnote 2 It supports OpenMP and MPI debugging.

Distributed Debugging Tool

The Distributed Debugging Tool (DDT) is a commercial debugging tool developed by Allinea software that supports Intel Xeon Phi.Footnote 3 The tools support OpenMP and MPI debugging in addition to scalar code debugging. It can be used for cluster debugging purposes.

Optimization Tool: Intel Vtune Amplifier XE

Intel Vtune Amplifier XE provides a GUI and command line tool for understanding an application runtime profile and allows you to detect the critical code sections, also known as hot spots, and optimize the code for better performance on the Xeon Phi coprocessor. It can also provide you with critical information such as bandwidth used by your application during its runtime, threading behavior, load imbalance, and so forth to help detect possible performance bottlenecks.

Intel Xeon Phi hardware implements performance monitoring hardware to allow low overhead profiling of the runtime execution behavior of applications.Footnote 4 The Vtune Amplifier XE comes with a sampling driver that gets installed on the Xeon Phi coprocessor OS. This driver is responsible for communicating with the Xeon Phi hardware performance monitoring unit and collecting hardware events corresponding to applications executed on the coprocessor. You can start the event collection process from the host using the GUI or command line interface provided by the Vtune Amplifier XE tool. The help documentation installed is a good guide on how to use the Vtune Amplifier XE. Here I will provide only a rough outline on how to use the tools for profiling the Amplifier XE binaries.

In order to collect performance data for applications running on Xeon Phi, you need to take following steps:

  1. 1.

    Set up a project on the Vtune Amplifier XE that is responsible for executing application that will run on the Xeon Phi coprocessor. Because the application is executed from the host, you need to make sure the scripts called from the project set-up target specification performs the necessary upload of the executable with the needed dependent files and executes the application on the Xeon Phi. You also need to build your binary with the appropriate debug symbols (-g –debug inline-debug-info), so the tool can point you to the source location of the hotspots located by the tool. Then you can specify the search directories using the set-up screen so that the source files are locatable by the tool. You may also need to add the Composer XE library path containing mic libraries (e.g., /opt/intel/composer_xe_2013/lib/mic) and the vmlinux (typically /lib/firmware/mic) to locate symbols related to these runtime libraries.

  2. 2.

    Start the performance event collection. You start this event collection analysis by clicking the ‘New analysis’ (‘>’) button on the toolbar. This will open an analysis window, where you can select various predefined analysis types include ‘Lightweight hotspot.’ The “Lightweight hotpot” profile is a good starting point to get an initial understanding of the application runtime profile on the coprocessor. Default collection happens on coprocessor ‘0.’ For other processors, you can enter the value or comma-separated values in the field “List of Intel Xeon Phi coprocessor cards.” Click the Start button to begin the collection process. The default “Lightweight hotpot” collects CPU_CLK_UNHALTED and INSTRUCTIONS_EXECUTED events. You can create a custom event for the Xeon Phi coprocessor by clicking the “Create New Analysis type” (showed using symbol Λ+) button on the Analysis type toolbar. From the drop-down list, select “New Knights Corner Event-based Sampling Analysis.” This will bring up the Custom Analysis window, as shown in Figure 8-1, where you can add custom events available for the Xeon Phi processor by clicking the “ADD” button in the dialog box. Note that the Add custom analysis dialog box may be small and may need to be stretched by grabbing the lower right corner before you can see the buttons.

Figure 8-1.
figure 1

Vtune Custom Analysis for Xeon Phi

  1. 3.

    View and interpret the results. This is the analysis phase, which is similar to any other Intel platform performance analysis. Here you look at the application hotspots and use your hardware knowledge of the Xeon Phi coprocessor, as discussed in Part 1 of this book, to understand what the possible issues are. Using application knowledge and tuning expertise, you may be able to modify code, user compiler switches, and pragmas to optimize the performance of the whole application.

Libraries

Intel provides the Intel Math Kernel Library (MKL) optimized for Xeon Phi. It provides technical computing applications for extracting performance from Xeon Phi coprocessors for commonly used math functions. MKL can be included in your code in two primary ways. You can use the Xeon Phi version of the library to link to the native version of the code or call the MKL routine from the offloaded version of the code.

Native or Symmetric Execution

Native execution occurs when the application runs completely in the Xeon Phi coprocessor under the coprocessor OS. This requires minimal changes to the application and may benefit some code where the processes can execute in parallel with host processes, such as in symmetric mode. To build an application that runs natively, you need to use the –mmic switch, as discussed earlier in this chapter on compiler usage. You also need to link in the Xeon Phi version of the MKL. The native version of the MKL library is by default installed in the /opt/intel/composer_xe_2013/mkl/lib/mic folder in your build system. Once built and linked this way, you can execute this code under the native OS or use the binary as part of an MPI run in symmetric mode. I cover the MPI execution details later in this chapter.

Note that you need to send all the necessary dependent libraries to Xeon Phi for execution and set the runtime environment appropriately for dynamically linked libraries.

Compiler-Assisted Offload

In this execution model, you use compiler offload or target directives to send computation to the coprocessor. The part of the execution that happens on the coprocessor can call into MKL by linking the Xeon Phi version of MKL to that section of code. An example of such usage using the sgemm call is shown Listing 8-8. Here matrices A, B, and C are sent to the coprocessor and the resulting matrix C is returned to the host. Please refer to Intel MKL Library users guide on how to link in the MKL libraries for Xeon Phi offload. It is similar to host linking except a different path is used to get the libraries.Footnote 5

Listing 8-8. Code Using the sgemm Call

#pragma offload target(mic:0) \

    in(A: length(matrix_elements) ) \

    in(B: length(matrix_elements) ) \

    in(transa, transb, N, alpha, beta) \

    inout(C:length(matrix_elements) )

    {

        sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,

                &beta, C, &N);

    }

Using the Automatic Offload Version of the MKL Library

In this case, the MKL library runtime executing on the host processor detects the presence of a coprocessor and can automatically offload some of the processing to the coprocessor to take advantage of additional computing power provided by Xeon Phi. This usage model enables you to link your application to the MKL as you would normally do on the host OS, however, using an environment variable or a function call, thus providing performance improvement with fewer changes to the code than compiler-assisted offload. MKL determines the best division of work between the host and the coprocessors. However, you can specify custom work division using the environment variable of utility functions provided by MKL. You need to set MKL_MIC_ENABLE to 1 or call mkl_mic_enable() in your code to activate automatic offload of MKL computations to Xeon Phi coprocessor. Other relevant environment variables and APIs are:

  • MKL_HOST_WORKDIVISION: Its value is a floating point number between 0.0 and 1.0 and specifies the fraction of work that will be done by the host processor. For example, a value of 0.2 will indicate 20 percent of the work is done on host processor and 80 percent is offloaded to coprocessors in the system. Note that corresponding API is mkl_mic_set_workdivision().

  • MKL_MIC_WORKDIVISION: Indicates the fraction of work to be performed on the Xeon Phi coprocessors. This divides the work among all the coprocessors available in the system. If you want to target specific coprocessors, you can use MKL_MIC_<coprocessor_number>_WORKDIVISION. The corresponding API is mkl_mic_set_workdivision().

  • MKL_MIC_MAX_MEMORY: Specifies the maximum coprocessor memory that can be used by automatic offload in all coprocessors. Specific coprocessor can be designated by MKL_MIC_<coprocessor_number>_MAX_MEMORY. Memory size is in kilobytes (K), megabytes (M), gigabytes (G), or terabytes (T). The equivalent API is mkl_mic_set_max_memory().

Third-Party Math Libraries

There are several other libraries available for application development on Xeon Phi coprocessor that you may make use of. Some are summarized in the following sections.

Magma

Magma MIC was developed by the Innovative Computing Laboratory at the University of Tennessee. It provides dense linear algebra library routines ported to Xeon Phi. It is available at http://icl.cs.utk.edu/magma/software/index.html .

ArrayFire

ArrayFire by AccelerEyes is an OpenCL library that supports Xeon Phi coprocessors and Xeon processors. The library contains various math, financial, and image-processing functions of use to various technical computing and other computational domains.

Intel Cluster Tools

Intel Cluster toolshelp you to build applications that can be run and analyzed in a cluster environment containing Xeon Phi coprocessors. Because Xeon Phi is mainly expected to be used in technical computing or high-performance computing applications, these tools are critical to support such application enabling. For the Xeon Phi coprocessor environment, the tool includes the MPI library that helps you create an MPI-based application that runs natively in a Xeon Phi coprocessor, a node with a Xeon Phi processor in symmetric mode, or a cluster of nodes with Xeon Phi coprocessors. The toolset also includes Intel Trace Collector and Analyzer, which allows you to understand how the messages are exchanged between various MPI processes (ranks) and how the computation and communication overlap.

Chapter 7 on system software explained that the Xeon Phi coprocessor provides extensive support for the Intel MPI library. The usage model for Xeon Phi is the same as that of the host except you need to link in the Xeon Phi version of MPI with your application binary targeted toward execution on the coprocessor OS. For native applications running solely on Xeon Phi, you can link in the Xeon Phi version of the MPI library with the code and set up the library path and binary path so that the executable can locate the MPI drivers (i.e., mpiexec.hydra, mpirun). The execution is similar to that of an MPI application execution on host processor environment on Linux OS.

For symmetric execution, you need to build two sets of binaries, one for Xeon and the other for Xeon Phi. The Xeon Phi binary should be cross-compiled with the –mmic switch so it can execute native on Xeon Phi. For example, say you have built two MPI-based binaries, test.mic and test.host. You can place the necessary Xeon Phi binary on the card, in the /tmp directory, with all the relevant libraries. In this case, a symmetric execution will be invoked as follows:

mpirun –n 1 –host localhost /tmp/test.host : -n 1 –host mic0 /tmp/test.mic

This will cause the test.host and the test.mic to start on the host and mic0 coprocessor as two MPI ranks of a single application test, and they can communicate with each other through the MPI messages to perform its task. Note that the MPI tasks can use OpenMP underneath to make use of the processor and coprocessor cores for its execution. You can set the type of fabric used by the MPI by setting the I_MPI_FABRICS environment variable. For example, to use RDMA for MPI communication between the host and Xeon Phi, use I_MPI_FABRICS=shm:dapl.Footnote 6

Third-Party Cluster Tools

In addition to the tools provided by Intel Cluster, there are other third-party tools available for cluster development and management with support for Xeon Phi coprocessors. Some of these are listed in the sections that follow.

PBS Professional

PBS Professional by Altair is a cluster job scheduling software that supports the Xeon Phi coprocessor for cluster job scheduling, management, and monitoring.Footnote 7This is done by native integration of the coprocessor as a resource for job scheduling in the PBS scheduling manager. The tool can log the coprocessor usage, such as the number of coprocessors or number of cores in a coprocessor, to help the management process.

Bright Cluster Manager

Bright Cluster Manager by Bright Computing allows you to install, manage, and monitor clusters with Xeon Phi coprocessors by native integration of the coprocessor as a computing resource.Footnote 8 The tool accounts for Xeon Phi resources in its scheduling and management process of HPC clusters.

Summary

This chapter looked at various tools and libraries developed by Intel and third parties to help develop, debug, and tune applications on hosts containing the Intel Xeon Phi. It is important to learn to use these tools properly to get the benefits from the Intel Xeon Phi.

The next chapter will cover the application development considerations on Xeon Phi coprocessor using the tools and libraries described here. These considerations are necessary to get optimal performance on Xeon Phi coprocessors.