Keywords

1 Introduction

Graphics processing units (GPU) was initially designed to process visual data, but it composes multiple cores that can process large blocks of data in parallel. GPU technology is used in various applications in recent years because of its highly parallel architecture resulting in more computational power. GPU is not only for graphics related applications, but could also be used to improve the performance of various applications. A large portion of GPU memory is idle in systems that are not intended for GPGPU computing. Transcendent Memory [1] provides an approach to collect the idle physical memory in a virtual environment and manage it with application program interfaces (API). These interfaces are used by the host systems to access the idle physical memory when required. We strongly support this approach in our system to utilize the idle GPU memory for improving the performance of OS.

We have implemented a system that manages GPU as cache memory for file system IO requests in operating systems.

Similar work is proposed in BAG(GPU as buffer cache) [2]. Our work is an extension to the BAG and we have added more security to that existing system. Data security is ensured in our system by encrypting and decrypting the data using Twofish algorithm [3], whereas GPU as buffer cache [2] used AES encryption algorithm. Twofish algorithm [3] encrypts 128 bits of data using key of varying size up to 256 bits. Symmetric keys are used by Twofish algorithm to encrypt the plain text and decrypt the cipher text. Twofish works faster than AES hence reducing the time for encryption and decryption.

2 Design and Implementation

2.1 System Architecture

Graphics processing unit is composed of thousands of computational cores that can process the data in parallel. Input and output requests sent by the user applications to the file system are processed by the operating system through system calls. We have developed a system that acts as an interface between the OS and User applications. Our system utilizes GPU memory as cache memory and stores the most recently accessed file content. Upon receiving the similar request, the content from the cache memory is used instead of fetching the data from file system.

Our system is composed of three main components: indirector, relay, and user space daemon. Indirector receives the read requests and write requests from the user application. A data block is identified by the logical block address and each request specifies its type (read or write) and the logical block address. Separation of read and write requests is the first step, which is followed by the lookup operation. Lookup operation is used to find the existence of the required block in the GPU memory, hence, divides the requests further into hit requests and miss requests. Read hit requests are sent to relay and the read miss and write requests are redirected to the OS. The architecture of our system is shown in Fig. 1.

Fig. 1
figure 1

Parallel cache management system architecture

Relay is responsible for maintaining read request queue and write request queue. Multiple threads are invoked to process the read hit requests in parallel and the data blocks from the GPU are transferred to the user space.

This system stores the file system data in GPU memory, hence, confidentiality of data is to be ensured. We implemented Twofish algorithm to encrypt the plain text before storing in GPU memory and decrypt the cipher text after fetching the data from GPU memory. Identical keys are used for encryption and decryption and this key is not stored in the GPU. Encryption and decryption are done by the CPU core to maintain the data confidentiality.

Garbage collector kernel revokes the memory that has older blocks makes it available for new threads. The system manages the GPU memory as circular buffer where the front end of the buffer stores the most recently accessed data block.

2.2 Hash Table and LRU List Data Structure

Existence of data block in the GPU is identified by the hash table. Hash table is implemented with 1D pointer array with direct addressing where the hash function returns the logical block address and hash value represents the GPU address in which the respective data block is cached. Garbage collector revokes the memory that stores the oldest block if required; hence, it is necessary to store the access order of the data blocks. Least recently used list is used for this purpose and a DLL (list that has nodes connected to both previous and next nodes) is used to keep track of the access order. Hash table and LRU list Data Structure is shown in the Fig. 2.

Fig. 2
figure 2

Hash table and LRU list data structure

2.3 Parallelizing the IO Request

The IO requests received in our system are divided into read and write requests by the indirector. It separates the read hit requests from read miss and write requests. Splitting of IO requests is shown in Fig. 3.

Fig. 3
figure 3

Parallelizing the IO request

IO requests are divided into batches. Number of IO requests in each batch is equal to the number of GPU blocks allocated for the implementation of the proposed system. Requests are processed batch wise. IO requests will have the logical block address and the lookup operation is performed on the hash table to find the existence of the corresponding block in the GPU. If the requested block is present in the GPU, then it is treated as read hit, otherwise read miss. Dividing the read requests is parallelized where multiple GPU threads will access the hash table simultaneously. First the read hit requests are processed and then the read miss requests. Retrieving the content from the GPU is done parallely by multiple threads simultaneously and each thread processes a single request in each batch.

Read miss requests are processed sequentially in CPU. The requested block is retrieved from the file system and then loaded directly in the GPU if it has enough space. If the GPU space is insufficient to load a new block, then cache eviction algorithm finds the victim block and replaces it with the new block. Least recently accessed block is treated as the victim. Tail node of the LRU list will always store the address of the least recently accessed block. Dirty bit is used to notify that the corresponding block is modified or not. If the dirty bit of the victim is zero, then it is directly replaced with the new block. Otherwise, the content of the victim block is updated in the file system and then replaced with the new block.

The write requests are processed sequentially in CPU. If the requested block already exists in the GPU then, it is over written and its dirty bit is set as one. Otherwise, the requested block is loaded in the GPU from the file system and updated with the new content. Dirty bit of the updated block is set as one. Updation is not reflected to the file system at the time of processing.

2.4 Cryptographic Kernel

Cryptographic kernel is used for encryption and decryption of data blocks. In the proposed system, Twofish algorithm [3] is used for encryption. Plain text is converted into cipher text and then it is stored in the cache. The Twofish architecture as given in “Twofish: A 128-Bit Block Cipher” is shown in the Fig. 4.

Fig. 4
figure 4

Twofish architecture [3]

Twofish algorithm [3] encrypts 128 bits of data using key of varying size up to 256 bits. Same cryptographic keys are used by the Twofish algorithm to encrypt the plain text and decrypt the cipher text. It has a complicated algorithmic design where the plain text is preprocessed and function ‘F’ is executed in a series of 16 rounds.

2.4.1 Feistel Network

Feistel network is a, onto function that converts the given function ‘F’ into its permutation as defined in Eq. (1).

$$F:\{ 0,1\}^{n/2} *\{ 0,1\}^{N} \tilde{a}\{ 0,1\}^{n/2}$$
(1)

For every block of size n bits, it processes each n/2 bits of block with n bits of key and transforms it into a permutation string of size n/2 bits.

2.4.2 S-Boxes

Twofish is more secured since it uses four S-Boxes varies based on the cryptographic key and makes it more secured. Two fixed 8-by-8-bit permutations and cryptographic key are used to build these S-boxes and every data block is substituted based on the values stored in these S-Boxes nonlinearly.

2.4.3 MDS Matrices

It is the main diffusion mechanism that uses an MDS matrix of size 4 × 4. A matrix is said to be maximum distance separable if and only if the determinant of all possible square submatrices are nonsingular.

$$u = [{\text{MDS}}][u]$$
(2)

It maps ‘v’ to ‘u’ as given in Eq. (2) by multiplying it with MDS matrix where ‘u’ and ‘v’ are vectors of four bytes each.

2.4.4 Pseudo-Hadamard Transform (PHT)

Pseudo-Hadamard Transform is fast and reversible diffusion operation. Given two inputs a and b, 32 bit PHT is defined as

$$a^{\prime} = a + b\bmod 2^{32}$$
(3)
$$b^{\prime} = a + 2b\bmod 2^{32}$$
(4)

two 32-bit Pseudo-Hadamard Transforms takes place parallely on the outputs from MDS.

2.4.5 Whitening

Whitening is a process of performing bit-wise XOR operation with key material before the first round and after the last round where the former is known as pre-whitening and the latter is known as post-whitening.

Twofish encryption algorithm splits the plain text (T) into four 32-bit words (T0, T1, T2, T3) and the input whitening step takes place individually. The first two words (T0, T1) on the left are used as inputs to g function. The g function contains 4 byte wide key dependent S-Boxes and a MDS matrix. The output of two g functions are combined using a PHT. The outputs of PHT are XORed with the other two words (T2, T3). The left (T0, T1) and the right (T2, T3) halves are swapped for the next round. These steps are repeated for the next sixteen rounds. The swap of last round is reversed and the four words (T0, T1, T2, T3) are XORed with four key words to produce cipher text.

2.5 Algorithm

3 Simulation Environment

3.1 NVDIA GeForce 940M

It is a graphics processing unit that can support DirectX-11 technology. It has two Giga bytes of DD3_SDRAM and 384 shader units. It is built with Maxwell architecture which is more efficient than Kepler architecture. GeForce 940M is composed of smaller streaming multiprocessors with 128 ALUs and an optimized scheduler. It is better than GeForce 840M because of its higher clock rate.

3.2 CUDA Toolkit Version 6.5

CUDA toolkit provides an environment to develop software that can run on GPU cores and utilize GPU memory. CUDA is a programming language similar to C language with additional support for developing GPU codes that can access the GPU resources directly. CUDA program includes both device functions (called from GPU and executed on GPU) and global functions (called from CPU and executed on GPU). CUDA programs are executed only in computers with NVDIA Graphics Processing Units. Multiple GPU threads can execute a single global function parallely on GPU cores. The first API to access the GPU resources directly is CUDA. The C programs can invoke the CUDA program and the converse is also possible.

3.3 NVCC

The compiler driver for CUDA language is NVCC. The motive of NVCC is to hide the compilation details of CUDA from the developers. The non-CUDA code is compiled by the C++ compiler. The NVCC does these forwarding options and it takes care of other operations like splitting, merging and preprocessing steps of the CUDA code.

4 Evaluation

The proposed system is executed in a PC with 2GB NVDIA GeForce 940M graphics card. Operating System is Centos 6.5. Time taken to process 100 requests using Sequential and parallel cache is shown in Table 1. Time taken to process 500 requests using Sequential and parallel cache is shown in Table 2. Time taken to process 1000 requests using Sequential and parallel cache is shown in Table 3.

Table 1 Comparison of parallel and sequential cache with number of requests = 100
Table 2 Comparison of parallel and sequential cache with number of requests = 500
Table 3 Comparison of parallel and sequential cache with number of requests = 1000

Tabulated results shows that parallel cache management with multithreading is better than sequential cache management for large number of IO requests. Parallel cache takes longer time to process fewer requests because of the overhead in dividing the read hit and read miss requests.

5 Conclusion

A system that manages GPU as buffered cache in operating system is implemented. Data encryption provides security to the data present in the GPU. Twofish algorithm is used for encryption and decryption. With carefully designed data structures such as concurrent hash table, least recently used list, log structured data store this system achieves good performance under various workloads.