IOSIG: Declarative I/O-Stream Properties Using Pragmas

Gholami, Masoud; Schintke, Florian

doi:10.1007/s13222-022-00419-w

IOSIG: Declarative I/O-Stream Properties Using Pragmas

Schwerpunktbeitrag
Open access
Published: 20 June 2022

Volume 22, pages 109–119, (2022)
Cite this article

Download PDF

You have full access to this open access article

Datenbank-Spektrum Aims and scope Submit manuscript

IOSIG: Declarative I/O-Stream Properties Using Pragmas

Download PDF

1581 Accesses
1 Citation
Explore all metrics

Abstract

Non-functional properties of IO-streams are typically not specified and passed to the used middleware or operating system. Knowing properties such as the expected access pattern, reliability, and visibility for data would allow a better storage resource selection by the infrastructure and thus could improve overall performance. With pragma annotations, we let developers declare the intended data access for their file descriptors and automatically redirect the corresponding input/output to the most appropriate storage device of the user’s system. Our automatic resource selection lets applications leverage system storage capabilities better, often delivers improved performance (in some cases by orders of magnitude), and relieves uninformed users from having to make the best choice manually.

Summary of I/O Concepts

The Adaptable IO System (ADIOS)

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An HPC (High-Performance Computing) system or cluster contains a hierarchy of different storage devices with various characteristics. From the locally available volatile ramdisk and non-volatile RAM, SSD (Solid State Drive), HDD (Hard Disk Drive), and RAID system [35, 37] of the nodes to the shared network filesystems such as Ceph [43], Lustre [29], or NFS [7]. Each of these options provides particular properties that suit them for distinct use cases. Some storage options provide redundancy and erasure coding (RAID, ZFS [6], Ceph) to ensure the reliability of data, while some other options prefer performance to reliability (XFS [42], ramdisk).

From the application’s perspective, some data should be kept persistent, while other data are needed temporarily and locally or are written for archival reasons but require global visibility. Persistent data may need redundancy or has to be stored tamperproof. Access to input/output data may be random or sequential.

Applications usually do not bother to deal with the complexity of choosing the right storage device for their specific use cases and requirements. They also do not reveal their intended use to the operating system or middleware. Hence, their input/output (short I/O or even IO) activity does not match well with their requirements on reliability, performance, and cost. The application developers have insight into the application’s IO behavior/demand, though, they are unaware of the underlying IO infrastructure on the user’s side. Hence, the IO destinations are usually chosen by the uninformed user regarding the IO demand/infrastructure.

We developed IOSIG, a GCC plug-in, to specify non-functional properties of IO. IOSIG then transparently redirects IO activities to the optimal storage option at compile-time considering the particular storage device characteristics of a system (Sect. 2). We define a notion to declaratively specify non-functional IO properties for file descriptors using pragmas (Sect. 3). We recognize available block devices and their properties via a config file and/or an automated process (Sect. 4). We introduce an algorithm to derive the cost of different block devices considering several parameters to guide the later resource selection in Sect. 5. We inject code to redirect IO to the device that matches the specified non-functional IO properties best (Sect. 6) and demonstrate the gained performance improvements in several usage scenarios in the evaluation (Sect. 7).

2 Design

We provide a mechanism to describe the IO streams of an HPC application by defining a set of parameters using C/C++ pragma notions. A pragma notion describes the file descriptor below it (see Fig. 1). The file descriptors can be described using several parameters to declare whether its IO activity resembles more a random or a sequential pattern, whether the data is temporary or persistent, and many other aspects of the IO stream. Sect. 3 elaborates on different parameters that can be defined using pragma notions.

On the system side, we need to detect the visible storage devices of the system along with their properties. This is done by an automated process looking for visible block devices and fetching their properties as much as possible, combined with a configuration file providing information that cannot be detected automatically (see Sect. 4 for more details). To consider reliability, we use the Mean Time To Data Loss (MTTDL) [20] metric of storage devices and define it based on the Mean Time To Failure (MTTF) [38] and Bit Error Rate (BER) [12, 33] of the particular storage hardware (SSD, HDD, etc.).

For each IO stream, our algorithm constructs and evaluates performance models for different storage devices, considering the defined parameters for the IO stream and properties of storage devices. Based on the performance models, the optimal storage option is chosen (see Sect. 5). Then, the IOSIG plugin injects code into the HPC library/application at compile-time to redirect its IO streams to the optimal storage devices (Sect. 6).

3 Declaring IO Demand with IO Signatures

Each group of parameters defined by a pragma on top of a file descriptor is denoted as the IO signature and has the pragma namespace sig (see Fig. 1). In a sig pragma, the following parameters can be defined: global or local to determine the required visibility of data. For jobs using multiple machines, a redirected global (shared) IO destination will be the same in all processes/machines of the job. Hence, concurrent IO between processes on the same file remain working with pragmas redirecting shared IO; random or sequential to define whether the IO activity of the file descriptor is dominated by a random or a sequential pattern; persist or temp to indicate whether the data is temporary or not; size-per-IO(<val>) to define the typical size of write/read chunks of the data; totalsize(<val>) to define the total size of the data; MTTDL(<val>) denotes the desired reliability of the data in terms of the minimum required Mean Time To Data Loss of the storage system; availability(<val>) defines the desired maximum probability of losing the data during the job’s lifetime. This is another way of representing reliability and can also be directly computed from MTTDL when the job duration is known; archive to indicate that the data is mainly written for archival purposes and seldom read. A low-performance but durable storage device would be fine; tamperproof to indicate the data should not be altered later and should be protected, e.g., with ROM storage or blockchain (‘archive’ and ‘tamperproof’ are not the main focus of this paper and the evaluation).

While the behavior of a file descriptor can be described with a sig pragma notion, there may be cases where several file descriptors in a library or application share the same or similar behaviors. For this case, another pragma namespace sigdef can be used to define a signature and give it a name. Then, each sig pragma can refer to the name of a defined signature and may even overwrite some parameters of the signature for its file descriptor as shown in Fig. 2.

4 Characterizing Storage Capabilities

To expose the system’s visible storage devices and their properties to our tool, we model the storage systems as illustrated in Fig. 4. We consider storage systems of a machine on the granularity of locally established mount points. A mount point has a block device (simple or complex), a file system, and possibly a label (specifying mount points for archive, tamperproof data, etc.). A file system has a block size, a maximum file size, and an ECC level (Error-Correcting Code) indicating the maximum number of correctable bit errors per sector. On the other hand, a block device has its input and output bandwidths, latency, capacity, and path (on /dev). A block device could be formatted with a filesystem or not. The MTTDL of a file on a block device depends on the file size, the device’s characteristics, and maybe the used filesystem. Moreover, a block device could be simple, complex, or a partition.

A simple block device points directly to an SSD, HDD, RAM, etc. while a complex one is composed of multiple sub-devices, e.g., a RAID system containing two or more other block devices, or a block device with a network filesystem operating on block devices of multiple storage nodes. A partition of a block device has an owner block device (the owner cannot be another partition). Fig. 3 shows a sample of visible system block devices according to the output of the lsblk command. It represents a RAID 0 storage system composed of two partitions on two disks (simple block devices) along with two normal partitions.

Higher-level storage systems such as NFS, Lustre, or Ceph can be taken into account as well and are modeled as complex block devices. The properties of such devices are determined by our plugin using benchmarks, Linux, or system-specific utilities. This way, properties of these systems such as the number of disks and disk models backing such a storage system, the actually used replication degree, sustained IO performance via the network, etc. (see also Sect. 4.2) could be determined.

4.1 Deriving MTTDL of Storage Systems

A simple block device (HDD, SSD, etc.) has a Mean Time To Failure (MTTF) addressing the failure frequency of the device (e.g., HDD operational failures [12, 13], SSD chip failures [3, 32], etc.), and a Bit Error Rate (BER) [12]. Bit errors in sectors can be corrected by the error-correcting code (ECC) of the file system if they do not exceed the maximum correctable errors per sector. Otherwise, they are uncorrectable. The Uncorrectable Bit Errors Rate (UBER) is computed as follows [33]:

$$\mathit{UBER}=\frac{\sum_{n=E+1}^{N}\binom{N}{n}\cdot\mathit{BER}^{n}\cdot(1-\mathit{BER})^{N-n}}{N-E}$$

(1)

where $N$ is the sector size and $E$ is the maximum number of correctable errors per sector. For a complex block device consisting of multiple block devices (e.g., disks) MTTDL is defined by constructing Markov models [17] according to the internal (graph-)structure of the complex block device (e.g., Ceph network of storage nodes, Lustre, RAID‑0, RAID‑1, etc.) and using MTTF, BER, and MTTDL of the sub-devices in the Markov model. For instance, the MTTDL value for MDS (Maximum Distance Separable) erasure codes tolerating $m$ fails out of $m+n$ drives (e.g., XOR [22, 35], Reed-Solomon [11], etc.), considering the MTTF, UBER, and Mean Time To Repair (MTTR) of the disks (sub-devices) is computed as follows [23]:

$$\mathit{MTTDL}=\frac{\mu^{m}}{d(d-1)(d-2)\ldots(d-m)\lambda^{m}(\lambda+h\mu)}$$

(2)

where $d=m+n$ (total number of disks), $\lambda=1/\mathit{MTTF}$, $\mu=1/\mathit{MTTR}$, and $h=\mathit{size\cdot UBER}$. Replication with $k$ replicas could be imagined as a special case of the $n+m$ scheme with $n=1$ and $m=k-1$. Eq. 2 can be used to define MTTDL of different reliability schemes (erasure codes, replication) performed by network file systems such as Ceph, NFS, and Lustre on a network of storage nodes as well as locally operated reliability on multiple local disks such as RAID and ZFS [6]. For MTTDL values of non-MDS codes refer to [20, 23]. A file on a simple block device (single disk) with no redundancy/parity/ECC is not considered as reliable. For a file on a complex block device (or its partitions), MTTDL is computed using MTTF and BER of its sub-devices according to Eq. 2. Another way of representing reliability is the probability of losing data during a given time (e.g., job’s lifetime). This is directly defined using the MTTDL of the file and by the cumulative distribution function (CDF) [34, Sect. 1.3.6.2. Related Distributions].

4.2 Gathering Device Characteristics

Fig. 4 shows parameters of block devices, mount points, and file systems. Some parameters can be determined automatically and some of them require information from the user (configuration file). We determine the parameters using configuration file ($\mathit{MTTF}$, $\mathit{BER}$, $\mathit{volatile}$, $\mathit{global}$/$\mathit{local}$, $\mathit{label}$ (for mount points), $\mathit{max\_filesize}$ (file system), $\mathit{archive}$, $\mathit{tamperproof}$), benchmarking [4, 25, 28] ($\mathit{latency}$, $\mathit{random}$/$\mathit{sequential}$ $\mathit{read}$, and $\mathit{write}$ bandwidths), system inspection ($\mathit{mount\_point}$, $\mathit{device\_path}$, $\mathit{capacity}$, $\mathit{block\_size}$, $\mathit{format}$), and computation ($\mathit{MTTDL}$ and $\mathit{availability}$). However, the user can still overwrite the automatically determined parameters using the configuration file.

5 Matching Demand and Resources

Sect. 3 characterized the IO demand of a job using a set of parameters, and Sect. 4 investigated the capabilities of available mount points in the system. In this section, given the parameters of the job’s IO demand and the capabilities of different mount points, we provide a mechanism to select the optimal mount point as the IO destination that satisfies the desired IO demand and delivers the most throughput. We distinguish between constraint and descriptive parameters for an IO demand specification. A constraint parameter requires the destination to meet a specific condition. For instance, global requires the chosen block device to be globally visible (shared among all processes). Constraint parameters are global, persistent, MTTDL, archive, tamperproof. On the other hand, a descriptive parameter describes the intended IO behavior and usage, e.g., random/sequential, and size-per-IO. To select the optimal storage location, first, the available mount points are filtered by whether they satisfy the given constraint parameters. Then, among the remaining choices, a cost model describing the throughput for the specified workload is evaluated for each choice, and the mount point promising the highest throughput is chosen.

For a simple block device (SSD, HDD, etc.) the maximum read/write bandwidth promised by the manufacturer can be used. For complex block devices (e.g., RAID, Ceph, etc.), the maximum bandwidths can be derived with benchmarks. In addition to the bandwidth, each IO operation has a latency before being performed. There are mainly two types of latencies, the operation- and seek-latency. The operation latency of each IO call occurs due to the delays/instructions of the OS, device, network, etc. Additionally, there will be an average seek latency for HDD in the case of random-access IO. Hence, HDDs have, for a fixed data size, a higher latency overhead with smaller size-per-IO and the achievable throughput is reduced. SSDs have much lower operation latencies compared to HDDs and have negligible seek time. Additionally, SSDs offer much more IOPS (IO operations per second) than HDDs [5]. The maximum IOPS is given by the manufacturer.

We use further parameters to model the performance of storage systems: Each mount point has a specific block size $\mathit{bs}$ and the OS has a page size $\mathit{ps}$. An IO operation of size size-per-IO will be transferred to the storage device in chunks of size $\mathit{cs}=\mathit{max(ps,bs)}$. If the size-per-IO is not a multiple of $\mathit{cs}$ or is smaller than $\mathit{cs}$, then, for the remaining part of data (less than $\mathit{cs}$), the OS reads $\mathit{cs}$ amount of data from the file at the corresponding offset, replaces the modified/inserted data in the chunk and writes it again. Therefore, for each IO operation of size size-per-IO, if size-per-IO is not a multiple of $\mathit{cs}$, the total amount of data written will be $\mathit{ts}=\lceil\textit{size-per-IO}/\mathit{cs}\rceil\times\mathit{cs}$, along with a read of size $\mathit{cs}$ [18]. Otherwise (size-per-IO divisible by $\mathit{cs}$), there will be no read amplification and $\mathit{ts}=\textit{size-per-IO}$. Hence, IO operations of size size-per-IO not divisible to/smaller than $\mathit{cs}$, require some extra reads and writes, and thereby deliver less performance.

The total cost to perform the IO of size size-per-IO is $\mathit{t_{IO}}=\mathit{l}+\mathit{ts}/\mathit{wbw}+\mathit{d}\times\mathit{cs}/\mathit{rbw}$ with $\mathit{l}=\textit{seek latency}+\textit{operation latency}$, $\mathit{d}=(\textit{size-per-IO}\,\,\%\,\,\mathit{cs}\,!\!=\,0)$ (0 if divisible, 1 otherwise), and $\mathit{rbw}/\mathit{wbw}$ the announced maximum read/write bandwidths (regardless of the latency, access pattern, block sizes, and the size-per-IO). We assume $\textit{seek latency}=0$ for sequential IO accesses and for SSD, Ramdisk, etc. With $\mathit{t_{IO}}$ being the duration of performing each IO call, the job could issue at most $1/\mathit{t_{IO}}$ IOs per second (IOPS). Given the maximum IOPS of a device and the fact that by each IO operation, size-per-IO amount of actual data is transferred to the device, altogether, the total throughput (transferred data per second) of a mount point is determined as follows:

$$t=\mathit{min}\left(\frac{1}{\mathit{l}+\Big\lceil\dfrac{\mathit{ds}}{\mathit{cs}}\Big\rceil\dfrac{\mathit{cs}}{\mathit{wbw}}+\mathit{d}\dfrac{\mathit{cs}}{\mathit{rbw}}},\frac{\textit{max-iops}}{\mathit{d}+1}\right)\times\mathit{ds}$$

(3)

where $\mathit{ds}$ is size-per-IO, $\mathit{cs}=\mathit{max(ps,bs)}$ is the block/page size of the mount point/OS, $\mathit{l}$, $\mathit{rbw}$, $\mathit{wbw}$, and max_iops are the average latency ($\textit{seek latency}+\textit{operation latency}$), the maximum read and write bandwidths, and the maximum IOPS of the block device, respectively. The value of $\mathit{d}$ is 0 if $\mathit{ds}$ is divisible to $\mathit{cs}$. Otherwise, $\mathit{ds}=1$, in which case, each write operation performs one additional read operation. Hence, at most max-iops/2 write operations could be performed per second. Sect. 7 shows that this model conforms to our real benchmarks of different storage devices quite appropriately. Eq. 3 is used to select the mount point with the maximum throughput among the remaining mount points after filtering the non-satisfying mount points according to the given constraint parameters.

6 IO Redirection

To let an application use another storage location than intended, we support two methods to redirect IO. One for compilable source code and another for binaries.

6.1 IO Redirection at Compile-Time

To enforce the use of the optimal storage location selected according to Sect. 5 for compilable programs, we inject a piece of code into the application at the place of the defined pragmas at compile-time. Fig. 5 shows the simplified code that is injected into the application at the location of the pragma definition. If the target file does not exist already, a symbolic link [1] pointing to the optimal destination (see Sect. 5) is created with the original filename of the ‘open’ call of the file descriptor. As a result, the IO stream of the file descriptor is automatically redirected to the optimal destination. Finally, the close(fd) call actually closes the file at the optimal destination.

There are several advantages of using pragmas to describe IO and define parameters. The source code remains compilable and executable without the IOSIG plugin. When the plugin is available, the same application can benefit from optimized IO without additional changes. On the other hand, adding some pragmas to an existing application at its expensive IO initializations (calls of open(…), fopen(…), etc.) is a low barrier an application developer or expert user could easily pass.

IOSIG is implemented as a GCC pass plugin. GCC compiles and optimizes a source code by calling a sequence of passes [19, Passes and Files of the Compiler] each responsible for a specific task. First, the parsing pass [19, Parsing pass] transforms a specific source language (C, C++, Fortran, Ada, Go, etc.) to a generic tree representation. Then, the generic representation is transformed into the GIMPLE [31] representation (a representation that can be optimized) [19, Gimplification pass]. Next, several optimization passes (IPA, high-level, low-level) are performed. Finally, the code is translated into the machine language. To interfere with the compilation process (i.e., parse the custom pragma notions, inject redirecting codes), we implemented some extra GCC passes and registered them at the proper positions in the sequence of GCC passes (using the GCC pass manager [19, Pass manager]).

The optimal destination is determined by the IOSIG plugin after lexing the pragma and fetching the defined parameters for the file descriptor. To lex pragmas we created a pragma handler (handling the $\mathit{sig}$ and $\mathit{sigdef}$ pragma notions) that is registered as the $\mathit{PLUGIN\!\_PRAGMAS}$ callback of GCC [19, Registering custom attributes or pragmas]. Our pragma handler uses the GCC lexer to parse the defined pragmas, fetch the parameters, and to generate errors/warnings if necessary. To make the necessary code changes, we implemented a GCC pass that is called by GCC after the gimplification pass and before the optimization passes. Our pass constructs the GIMPLE representation of the codes to be injected. As Fig. 5 shows, an $\mathit{if}$ statement checking the existence of a file, and a $\mathit{symlink}$ call within the $\mathit{if}$ statement is added. The first argument of the $\mathit{symlink}$ call is the optimal destination, which is determined considering the parsed pragma parameters. The second argument is fetched directly from the $\mathit{open(filename,...)}$ call in the source code. The generated GIMPLE code is inserted into the source code’s GIMPLE representation at the pragma’s location. If the $\mathit{symlink}$ function is not declared by the source code, we declare it in GIMPLE so we can use it.

6.2 IO Redirection of Binaries

If the source code or the knowledge to read the code of an application (to add the pragma definitions to it) is not available, we redirect the IO streams of the executable binary by other means. This approach requires the user to know the IO behavior of the application, i.e., the IO destinations of the application and the parameters that describe their desired IO behavior.

We provide an executable of IOSIG that receives a path (non-optimized IO destination of the application) along with different parameters as input (IO behavior description) and creates a symlink at the given path pointing to the optimal destination on the optimal mount point. This preparatory IOSIG executable can be called multiple times to create symlinks for each IO destination of the application. Then, the application is executed as usual and performs its IO without knowing about the created symlinks at its destinations. The symlinks redirect the IO transparent to the application. After execution, a finalizing executable transfers the actual data from the optimal to the desired locations.

While the source code of the application is not needed for this approach, it requires knowledge about the exact IO destinations of the application and the respective IO behaviors to provide the IOSIG parameters. Contrarily, in the case of the pragmas, the probably well-informed developer packs the parameters describing the intended IO into the application’s source code.

7 Tests and Evaluation

Table 1 presents the properties of the system we used for our experiments and Table 2 shows performance properties of the storage systems we evaluate our plugin on (maximum write and read bandwidths, operation- and seek- latencies, maximum IOPS, etc.). The user does not need to provide all this information as we designed a benchmarking mechanism that automatically determines most of the required information about devices. As Table 2 shows, our fastest device is the ramdisk with a limited capacity of up to $1\,\text{GB}$. Files larger than that (see totalsize pragma, Sect. 3) or having the demand for persistent or global storage will be redirected to other devices. The HDD and NFS are the slowest devices, but provide the largest capacities among the other choices. The shared NFS is the only global device in our system. The NVRAM is the fastest persistent device, and Raid 1 provides reliability.

Table 1 System properties

Full size table

Table 2 Properties of storage devices

Full size table

7.1 Analytical Model Evaluation

We choose the optimal storage device for a particular IO demand based on our cost model (Sect. 5). Fig. 6 compares the model of Eq. 3 to the actual costs of writing data chunks for several chunk sizes and different storage devices of Table 2. We use the command-line utility dd [28] to repeatedly generate IO with different chunk sizes (size-per-IO). With doubling chunk sizes between $512\,\text{B}$ to $1\,\text{GB}$, Fig. 6a and b illustrate a good match between our model and the actual experiments in all devices.

Fig. 6a shows for both, the model and the experiments, that the shared NFS delivers higher throughput than the HDD for chunks smaller than $\approx 100\,\text{MB}$, but for chunks larger than that, the HDD delivers better performance. While the HDD suffers from a higher latency than the shared NFS, it can deliver a higher maximum bandwidth (Table 2). Hence, for larger data chunks, the higher bandwidth outweighs the HDD’s higher latency compared to the shared NFS. Our plugin chooses the optimal device in such a scenario according to the given size-per-IO without requiring the user/developer to know about the details of the underlying IO infrastructure (see Sect. 7.3 for a practical use-case).

With linear increments of the chunk size between 2 to $256\,\text{KB}$, Fig. 6c presents the model and experiments of the shared NFS exposing effects of block size on the throughput. We see harmonical variations of the throughput reaching peaks and troughs every $22\,\text{KB}$, the internal blocksize of the device. Such blocksize can be determined by either calling some simple OS commands (e.g., creating an empty file and checking out the consumed storage space), or by performing some benchmarks for more complex cases (e.g., network file systems). We determine the closest block size match for complex block devices automatically by benchmarking and our plugin takes them into account when it chooses the optimal IO destination for a given size-per-IO. Effects of block and page size of $4\,\text{KB}$ can be seen in Fig. 6d–fthat compare model and experiments with chunk sizes between $512\,\text{B}$ to $32\,\text{KB}$ on several devices. Altogether, Fig. 6 confirms our model to predict the IO throughput well enough for finding the optimal destination for an application’s IO in general.

7.2 Workload and Test Scenario

To study our plugin with different access patterns, we developed a sample C/C++ program for image manipulation. We use the simple PGM [36] format that stores, after a short header for width and height of the image, byte-wise linear grayscale values (0–255) for each pixel without complex encoding or compression. We used this format merely to generate IO operations with different access patterns. The data could also be imagined as large matrices filled with values. We perform IO operations synchronized, i.e., the data is flushed to disk for each chunk. This should generate more stable, general, and objective results and reduce potential side effects of different operating systems (cache, memory, data size, etc.). Although, as we determine storage devices’ performance characteristics automatically using benchmarks (Sect. 7.1), our plugin also handles more specific scenarios involving different cache effects of the OS and various system properties. The user can complement or override the values in the configuration file.

IO operations are performed in different chunk sizes (size-per-IO) ranging from several kilobytes to several gigabytes. We will perform write operations with and without seeking (random/sequential) on several storage devices (local, global, persistent, temporary). We will show how our algorithm chooses the storage device for different IO demands and we will present the runtime improvement achieved by using our plugin.

We use three transformations on PGM images, illustrated in Fig. 8, in our test application to generate different access patterns. Posterization (Fig. 8b) writes the results chunkwise into the output file (chunkwise sequential access). Dividing a picture (Fig. 8c) requires strided IO accesses with seeking regularly to specific distances: every $2k$-th data chunk ($k\in\mathbb{N}\geq 0$) is written on the upper half of the file (seek to the offset $k$), and every $(2k+1)$-th data chunk is written on the lower half of the file (seek to the offset $\textit{totalsize}/2+k$). Image shuffle (Fig. 8d) shuffles the chunks of the input image (chunkwise shuffling) by inserting them at (pseudo) random places into the output file (seek to uniformly distributed offsets) to produce random-access IO. After computing each chunk, the chunk is flushed to disk.

7.3 Plugin Evaluation

Fig. 7 shows a template of how our application uses pragma notions. We perform our image transformations and test the effects of different pragma annotations.

Posterization: Chunkwise Sequential Access. Fig. 11a inspects the mentioned tradeoff between the local HDD and the shared NFS devices for an application with and without the pragma notions. The application loads a $6\,\text{GB}$ input image, posterizes it (Fig. 8b), and writes the output file chunkwise (sequential chunks) in $16\,\text{MB}$ to $2\,\text{GB}$ chunks. The corresponding pragma notion inserted for $16\,\text{MB}$ chunks is:

The totalsize pragma excludes the local SSD and NVRAM as storage devices due to capacity reasons, and Ramdisk is not considered due to the persistent pragma. The only devices capable of storing the output are the shared NFS and the local HDD. For chunks smaller than $\approx 100\,\text{MB}$, the optimal device would be the shared NFS, and the local HDD otherwise (see also Fig. 6a). We compare the overall runtime for different chunk sizes with and without pragma annotations when the application intends to use the local HDD vs. the shared NFS (Fig. 11a). With our pragma definitions, the fastest storage device is chosen for each particular chunk size automatically, independent of the initially targeted storage device of the application.

Divided Pixels: Strided Access. We evaluate how pragma parameters redirect the IO when performing regular seeks and writes into the upper and lower partitions of the output image. The size of the input image is $1\,\text{GB}$, and the IO accesses are performed chunkwise ($\textit{size-per-IO}=\textit{chunk}$, write and sync each chunk) in the range of $32\,\text{KB}$ to $32\,\text{MB}$. The destination of the output path is on the local HDD, which undoubtedly delivers poor performance. We aim to show how destinations are chosen for different given pragma notions, and how those improve the application’s lifetime. The inserted pragma notions ps1 to ps4 are listed in Fig. 10.

For $1\,\text{GB}$ storage size, all our devices could be used. As the accesses involve seeking, we use the random property. The parameter set ps1 can only be satisfied by the shared NFS, due to the global parameter. For ps2, global is not required, but the shared NFS could still be chosen as the optimal device. Instead, ps2 requires a minimum MTTDL of 20 years for $1\,\text{GB}$. Our RAID 1 system with 2$\times$SSDs achieves more than 100 years according to our reliability model of Sect. 4.1 (see Sect. 7.4 for details). Without a reliability demand, ps3 needs the data to be stored on a persistent device (persist), which excludes the Ramdisk, but leaves us with NVRAM, which is the fastest persistent device. Finally, if reliability and persistency are not required (ps4), then the Ramdisk will be chosen here.

Fig. 11b confirms the outlined device choices. The application’s lifetime is improved by applying different parameter sets in the pragma notion. Run on HDD without pragmas and in chunks of $512\,\text{KB}$ takes more than 5 minutes. The ps1 pragma (using NFS) improves the lifetime massively ($14379\,\text{ms}$), ps2 (local RAID) improves it further ($2519\,\text{ms}$), ps3 delivers the results within $1448\,\text{ms}$, and ps4 achieves the best performance ($818\,\text{ms}$) by using the Ramdisk.

Shuffled Image: Random Access. For random access involving uniformly distributed seeks of random lengths, Fig. 11c shows similar results compared to the strided IO access for the sets of parameters in Fig. 10. For small chunks, the application faces dramatic performance degradations, especially with the local HDD (as expected). Altogether, we see that the pragma notions successfully redirect IO to the appropriate device considering the IO demands and device characteristics.

The shown experiments used totalsize of $1\,\text{GB}$ and $6\,\text{GB}$ respectively. We also ran experiments with larger sizes and longer application runtime gaining analog results compared to the presented plots. As one can derive from Fig. 11a–c, the runtimes for each measurement point ranges from at least 6 seconds up to more than 16 minutes already for the smaller sizes.

7.4 MTTDL Evaluation

In a Monte Carlo [27] simulation we emulate faulty storage devices to confirm that the MTTDL model of Sect. 4.1 appropriately estimates storage reliability. To simulate a faulty environment, each device faces failures according to its given MTTF and produces bit errors with the given BER. We assume an exponential failure distribution as suggested by [15, 30]. Fig. 12 compares the modeled MTTDL of different RAID systems with the simulated values. An MTTF of 10 years per device is assumed [26]. The bit errors with the rate (BER) of $10^{-6}$ (one bit-error per $128\,\text{KB}$) are corrected by ECC, which fixes up to 2 bit-errors per 512-bit sectors [33]. This results in a UBER value of $\approx 10^{-14}$ (one uncorrectable bit-error per $10\,\text{TB}$) that is often quoted by the HDD and NAND-based storage manufacturers [33]. The recovery is assumed to take one day [20].

The simulations and model of Sect. 4.1 match properly (Fig. 12). As expected, larger data have lower MTTDL (more bit errors). For instance, $1\,\text{TB}$ has an MTTDL of 15 months on RAID5 and 12 years on RAID1. The MTTDL and availability (computed using MTTDL and the job’s lifetime) parameters of our plugin specify minimum values of them for a file.

We used existing models to describe the MTTDL of storage devices to demonstrate that IOSIG is beneficial in this context as well, and guiding parameters are not limited to access patterns.

8 Related Work

Declarative methods have a long tradition in data management to decouple the general aim of a task from implementation details. Most prominently, the SQL query language expresses the desired result of a query but does not specify how to obtain it and thus leaves a database management systems (DBMS) enough room for query optimization, the use and maintenance of indexes, data layout choices, etc. Such decisions typically take the performance characteristics of the system-at-hand into account, often in the form of reasonable assumptions, experience, or performance models.

DBMSs typically optimize IO, for example, by different means such as query optimization, blocking, or maintaining cold, warm, and hot IO on different storage devices [9, 10, 16, 41]. While a DBMS may not know how it will be used, as that depends on its usage, it probably knows some expected characteristics for its own data structures and storage locations, such as transaction logs, hot, warm and cold data. Especially, when the data size is large, the logging overhead can become significant [21]. Versioned NoSQL databases often aim for large, sequential, and append-only IO access to their data structures [39]. Typical IO patterns of data management systems could be characterized using the introduced pragma parameters as follows, showing that the proposed approach also can fit to data management systems:

hot data: small size-per-IO, random, persist
warm data: medium size-per-IO, persist
cold data: large size-per-IO, persist, archive
DB logs/journals: small size-per-IO, sequential (append-only)
DB tx rollback logs: med. to large size-per-IO, sequential (append-only)
DB tables: random (in-place changes), persist
versioned NoSQL: medium to large size-per-IO, sequential (append-only), persist

By annotating the different IO streams of a database system with their characteristics using pragma parameters in advance, the devices to use for them can be chosen wisely—supporting the overall database performance. This location selection is orthogonal to other IO optimizations of typical data management systems.

Declarative specifications are not limited to be beneficial in DBMSs. For example, Stiemer et al. [40] propose to automatically choose and steer the partitioning and replication degree in a Cloud setting based on the application requirements, application properties, and infrastructure properties. This structure resembles our approach to some extend: we describe the application requirements and properties using pragma parameters (Sect. 3), and fetch infrastructure properties using configuration files, benchmarking, system inspection, and computation (see Sect. 4.2). MapReduce [8] is another example to specify data processing in a declarative manner. To improve MapReduce performance and save disk space, Xiong et al. [45] leveraged the ‘hotness’ of data for data placement for Hadoop jobs. In contrast, we consider the existing storage devices and their capabilities on a lower level. Afify et al. [2] introduce an algorithm to specify hot and cold IO patterns from the workload using data mining. Thereupon, the hot data is maintained in main memory and the cold data in a cheaper/slower storage device. Similarly, Won et al. [44] exploit information gained from execution traces and their classification to perform pre-fetching and grouping for improved IO in the context of soft real-time systems. Whereas, we aim for giving developers the capability to pack their knowledge on the expected IO in the source code (i.e., describe the hot/cold IO using parameters) so that it can be considered to select the best fitting storage device. Expressing the desired reliability in terms of MTTDL is another example of declarative specification. For multilevel checkpoint restart, Gholami et al. [14] used MTTDL in a hierarchy of storage devices to select the level for the next checkpoint creation.

To gain performance models and understand IO characterization, IO workload of SSDs were analyzed and investigated by Yagdar et al. [46] confirming read and write amplification as well as the effects of block- and page sizes on the IO throughput. Huang et al. [24] constructed a black-box model for performance prediction of SSDs based on their hardware architecture, using machine learning, and statistical approaches. Our cost model remains independent of specific hardware devices and delivers a general cost estimation in a more abstract way. We aimed to have a general cost model that is accurate enough to select the best fitting storage device among different choices (see Fig. 6).

9 Conclusion

We presented IOSIG, a set of pragma definitions to declare non-functional properties of file access. Our implementation takes these pragmas into account, automatically selects the most appropriate storage device, and redirects corresponding IO to this device transparently for the user. With pragmas, the source code remains compilable also when IOSIG is not available. We offer the functionality for binaries, too. The approach in general is neither limited to the GNU compiler collection nor to the C/C++ programming language. Pragma extensions can also be implemented for LLVM, for example, and pragmas are common also in other compiler-based programming languages such as Fortran, Ada, and Go. Our evaluation showed that applications using IOSIG often achieve better performance (in some cases by orders of magnitude) with more informed resource usage. We also evaluated our plugin with different types of storage devices, such as NFS, NVRAM, and RAID, to demonstrate its flexibility for complex block devices.

References

(2017) symlink(2) – Linux manual page. https://man7.org/linux/man-pages/man2/symlink.2.html. Accessed 01 May 2022
Afify GM, El Bastawissy A, Hegazy OM (2016) Identifying hot/cold data in main-memory database using frequent item set mining. Int J Enhanc Res Manag Comput Appl 5(3):35
Google Scholar
Alter J et al (2019) SSD failures in the field: symptoms, causes, and prediction models. In: Proc. SC. ACM, pp 75:1–75:14 https://doi.org/10.1145/3295500.3356172
Chapter Google Scholar
Axboe J fio – flexible IO tester. https://git.kernel.dk/?p=fio.git. Accessed 01.2021
Bjørling M et al (2013) Linux block IO: introducing multi-queue SSD access on multi-core systems. In: 6th SYSTOR. ACM, pp 22:1–22:10 https://doi.org/10.1145/2485732.2485740
Chapter Google Scholar
Bonwick J, Moore B (2008) ZFS: The last word in file systems. SNIA Software Developers’ Conference. https://www.snia.org/sites/default/orig/sdc_archives/2008_presentations/monday/JeffBonwick-BillMoore_ZFS.pdf. Accessed 01 May 2022
Callaghan B, Pawlowski B, Staubach P (1995) NFS version 3 protocol specification. RFC 1813:1–126. https://doi.org/10.17487/RFC1813
Article Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
DeBrabant J et al (2013) Anti-caching: a new approach to database management system architecture. Proc VLDB Endow 6(14):1942–1953
Article Google Scholar
Diaconu C et al (2013) Hekaton: SQL server’s memory-optimized OLTP engine. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp 1243–1254
Chapter Google Scholar
Dongarra J (2008) A scalable checkpoint encoding algorithm for diskless checkpointing. In: Proc. IEEE HASE, pp 71–79 https://doi.org/10.1109/HASE.2008.13
Chapter Google Scholar
Elerath J (2007) Reliability model and assessment of redundant arrays of inexpensive disks (RAID) incorporating latent defects and non-homogeneous poisson process events. Ph.D. thesis, University of Maryland. http://hdl.handle.net/1903/6733. Accessed 01 May 2022
Elerath JG, Pecht MG (2007) Enhanced reliability modeling of RAID storage systems. In: 37th IEEE/IFIP DSN. IEEE Computer Society, pp 175–184 https://doi.org/10.1109/DSN.2007.41
Chapter Google Scholar
Estahbanati MG, Schintke F (2019) Multilevel checkpoint/restart for large computational jobs on distributed computing resources. In: SRDS. IEEE, pp 143–152 https://doi.org/10.1109/SRDS47363.2019.00025
Chapter Google Scholar
Falk M, Hüsler J, Reiss RD (1994) Laws of small numbers: extremes and rare events. Birkhauser, Boston
MATH Google Scholar
Funke F, Kemper A, Neumann T (2012) Compacting transactional data in hybrid OLTP & OLAP databases. arXiv preprint. arXiv, vol 1208.0224
Google Scholar
Gagniuc PA (2017) Markov chains. Wiley https://doi.org/10.1002/9781119387596
Book MATH Google Scholar
Giesen F (2015) Reading and writing are less symmetric than you (probably) think. https://fgiesen.wordpress.com/2015/10/25/reading-and-writing-are-less-symmetric-than-you-probably-think/. Accessed 01 May 2022
GNU Compiler Collection (GCC) Internals. https://gcc.gnu.org/onlinedocs/gccint/. Accessed 01.2021
Greenan KM (2009) Reliability and power-efficiency in Erasure-coded storage systems. Tech. Rep. UCSC-SSRC-09-08. University of California, Santa Cruz
Google Scholar
Groff JR, Weinberg PN, Oppel AJ (2002) SQL: the complete reference vol 2. McGraw-Hill, Osborne
Google Scholar
Gropp WD, Ross RB, Miller N (2004) Providing efficient I/O redundancy in MPI environments. In: PVM/MPI Users’ Group Meeting. LNCS, vol 3241. Springer, pp 77–86 https://doi.org/10.1007/978-3-540-30218-6_17
Hafner JL, Rao K (2006) Notes on reliability models for non-MDS Erasure codes. Tech. Rep. RJ10391 (A0610-035). IBM (https://dominoweb.draco.res.ibm.com/reports/rj10391.pdf)
Google Scholar
Huang HH et al (2011) Performance modeling and analysis of flash-based storage devices. In: 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, pp 1–11
Google Scholar
IOzone (2016) Filesystem Benchmark. http://www.iozone.org/. Accessed 01 May 2022
Kennedy P (2010) The RAID reliability anthology. https://www.servethehome.com/raid-reliability-failure-anthology-part-1-primer/. Accessed 01 May 2022
Kroese DP et al (2014) Why the Monte Carlo method is so important today. Wiley Interdisciplinary Reviews: Comput. Statistics, vol 6, pp 386–392
Google Scholar
Linux manual page dd(1). https://man7.org/linux/man-pages/man1/dd.1.html. Accessed 01 May 2022
Lustre Wiki (2017) Introduction to Lustre Architecture. http://wiki.lustre.org/images/6/64/LustreArchitecture-v4.pdf. Accessed 01 May 2022
Martino CD et al (2014) Lessons learned from the analysis of system failures at petascale: the case of Blue Waters. In: DSN, pp 610–621 https://doi.org/10.1109/DSN.2014.62
Chapter Google Scholar
Merrill J (2003) GENERIC and GIMPLE: A new tree representation for entire functions. In: Proc. GCC Developers Summit, pp 171–180
Google Scholar
Meza J et al (2015) A large-scale study of flash memory failures in the field. In: Proc. ACM SIGMETRICS. ACM, pp 177–190 https://doi.org/10.1145/2745844.2745848
Chapter Google Scholar
Mielke N et al (2008) Bit error rate in NAND flash memories. In: IEEE Int. Reliability Physics Symposium, pp 9–19 https://doi.org/10.1109/RELPHY.2008.4558857
Chapter Google Scholar
NIST/SEMATECH (2020) e‑Handbook of statistical methods https://doi.org/10.18434/M32189
Book Google Scholar
Patterson DA, Gibson GA, Katz RH (1988) A case for redundant arrays of inexpensive disks (RAID). SIGMOD. https://doi.org/10.1145/50202.50214
Article Google Scholar
Poskanzer J (2016) pgm – netpbm grayscale image format. http://netpbm.sourceforge.net/doc/pgm.html. Accessed 01 May 2022
Puri Ganesh D, Puri Dinesh D, Wackchaure Manoj A (2014) RAID 5 installations on Linux and creating file system. Int J Comput Appl 85(5):43–46. https://doi.org/10.5120/14841-3107
Article Google Scholar
Schroeder B, Gibson GA (2007) Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? In: 5th USENIX FAST, pp 1–16 (http://www.usenix.org/events/fast07/tech/schroeder.html)
Google Scholar
Sevilla Ruiz D, Morales SF, García Molina J (2015) Inferring versioned schemas from NoSQL databases and its applications. In: International Conference on Conceptual Modeling. Springer, pp 467–480
Chapter Google Scholar
Stiemer A, Fetai I, Schuldt H (2016) Analyzing the performance of data replication and data partitioning in the cloud: the BEOWULF approach. In: IEEE BigData. IEEE Computer Society, pp 2837–2846 https://doi.org/10.1109/BigData.2016.7840932
Chapter Google Scholar
Stoica R, Ailamaki A (2013) Enabling efficient OS paging for main-memory OLTP databases. In: Proceedings of the Ninth International Workshop on Data Management on New Hardware, pp 1–7
Google Scholar
Sweeney A et al (1996) Scalability in the XFS file system. In: USENIX ATC, pp 1–14 (http://www.usenix.org/publications/library/proceedings/sd96/sweeney.html)
Google Scholar
Weil SA et al (2006) Ceph: a scalable, high-performance distributed file system. In: 7th OSDI. USENIX Association, pp 307–320 (http://www.usenix.org/events/osdi06/tech/weil.html)
Google Scholar
Won Y et al (2006) Intelligent storage: cross-layer optimization for soft real-time workload. ACM Trans Storage 2(3):255–282. https://doi.org/10.1145/1168910.1168912
Article Google Scholar
Xiong R, Luo J, Dong F (2015) Optimizing data placement in heterogeneous Hadoop clusters. Cluster Comput 18(4):1465–1480
Article Google Scholar
Yagdar G et al (2021) SSD-based workload characteristics and their performance implications. ACM Trans Storage 17(1):1–26
Google Scholar

Download references

Acknowledgements

This work received funding from the German Ministry for Education and Research as BIFOLD–Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025C) and the German Research Foundation (DFG) as part of CRC 1404 FONDA. We thank ZIB’s core facility unit and HPC group for providing compute resources.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Zuse Institute Berlin, Berlin, Germany
Masoud Gholami & Florian Schintke

Authors

Masoud Gholami
View author publications
You can also search for this author in PubMed Google Scholar
Florian Schintke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Masoud Gholami.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gholami, M., Schintke, F. IOSIG: Declarative I/O-Stream Properties Using Pragmas. Datenbank Spektrum 22, 109–119 (2022). https://doi.org/10.1007/s13222-022-00419-w

Download citation

Received: 31 January 2021
Accepted: 27 May 2022
Published: 20 June 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s13222-022-00419-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

IOSIG: Declarative I/O-Stream Properties Using Pragmas

Abstract