FormalPara What You Will Learn in This Chapter

This chapter introduces GPU-accelerated image processing in ImageJ/Fiji. The reader is expected to have some pre-existing knowledge of ImageJ Macro programming. Core concepts such as variables, for-loops, and functions are essential. The chapter provides basic guidelines for improved performance in typical image processing workflows. We present in a step-by-step tutorial how to translate a pre-existing ImageJ macro into a GPU-accelerated macro.Footnote 1

5.1 Introduction

Modern life science increasingly relies on microscopic imaging followed by quantitative bioimage analysis (BIA). Nowadays, image data scientists join forces with artificial intelligence researchers, incorporating more and more machine learning algorithms into BIA workflows. Even though general machine learning and convolutional neural networks are not new approaches to image processing, their importance for life science is increasing.

As their application is now at hand due to the rise of advanced computing hardware, namely graphics processing units (GPUs), a natural question is if GPUs can also be exploited for classic image processing in ImageJ (Schneider et al. , 2012) and Fiji (Schindelin et al. , 2012). As an alternative to established acceleration techniques, such as ImageJ’s batch mode, we explore how GPUs can be exploited to accelerate classic image processing. Our approach, called CLIJ (Haase et al. , 2020), enables biologists and bioimage analysts to speed up time-consuming analysis tasks by adding support for the Open Computing Language (OpenCL) for programming GPUs (Khronos-Group , 2020) in ImageJ. We present a guide for transforming state-of-the-art image processing workflows into GPU-accelerated workflows using the ImageJ Macro language. Our suggested approach neither requires a profound expertise in high performance computing, nor to learn a new programming language such as OpenCL.

To demonstrate the procedure, we translate a formerly published BIA workflow for examining signal intensity changes at the nuclear envelope, caused by cytoplasmic redistribution of a fluorescent protein (Miura , 2020). We then introduce ways to discover CLIJ commands as counterparts of classic ImageJ methods. These commands are then assembled to refactor the pre-existing workflow. In terms of image processing, refactoring means restructuring an existing macro without changing measurement results, but rather improving processing speed. Accordingly, we show how to measure workflow performance. We also give an insight into quality assurance methods, which help to ensure good scientific practice when modernizing BIA workflows and refactoring code.

5.2 The Dataset

5.2.1 Imaging Data

Cell membranes create functional compartments and maintain diverse content and activities. Fluorescent labeling techniques allow the study of certain structures and cell components, in particular to trace dynamic processes over time, such as changes in intensity and spatial distribution of fluorescent signals. The method of live-cell imaging, taken as long-term time-lapses, is important when studying dynamic biological processes. As a representative dataset for this domain, we process a two-channel time-lapse showing a HeLa cell with increasing signal intensity in one channel (Boni et al. , 2015). The dataset has a pixel size of 0.165 \(\mu \)m per pixel and a frame delay of 400 s. The nuclei-channel (C1), excited with 561 nm wavelength light, consists of Histone H2B-mCherry signals within the nucleus. The protein-channel (C2), excited with 488 nm wavelength light, represents the distribution of the cytoplasmic Lamin B protein, which accumulates at the inner nuclear membrane (Lamin B receptor signal). Four example time points of the dataset are shown in Fig. 5.1.

Fig. 5.1
figure 1

Samples of the dataset used in this chapter: Time points 1, 5, 10 and 15, showing the signal increase in the nuclear envelope of a cell. Courtesy: Andrea Boni, EMBL Heidelberg/Viventis

5.2.2 The Predefined Processing Workflow

To measure the changing intensities along the nuclear envelope, it is required to define a corresponding region of interest (ROI) within the image. First, the image is segmented into nucleus and background. Second, a region surrounding the nucleus is derived.

A starting point for the workflow translation is the code_final.ijm macro file published by Miura (2020).Footnote 2 For reader’s convenience, we have added some explanatory comments for each section of the original code:

figure a
figure b

5.3 Tools: CLIJ

Available as optional plugin, CLIJ brings GPU-accelerated image processing routines to Fiji. Installation of CLIJ is done by using the Fiji updater, which can be found in the menu Help> Update, and by activating the update sites of clij and clij2, as shown in Fig. 5.2. Depending on GPU vendor and operating system, further installation of GPU drivers might be necessary. In some cases, default drivers delivered by automated operating system updates are not sufficient.

Fig. 5.2
figure 2

Installation of CLIJ: In Fiji’s updater, which can be found in the menu Help> Update..., click on Manage Update Sites, and activate the checkboxes next to clij and clij2. After updating and restarting Fiji, CLIJ is installed

After installing CLIJ, it is recommended to execute a CLIJ macro to test for successful installation. We can also use this opportunity to get a first clue about a potential speedup of a CLIJ method compared to its ImageJ counterpart. The following example macro processes an image using both methods, and writes the processing time into the log window, as shown in Fig. 5.3.

Fig. 5.3
figure 3

Output of the first example macro, which reports processing time of a CLIJ operation (first line), and of the classic ImageJ operation (second line). When executing a second time (right), the GPU typically becomes faster due to the so-called warm-up effect

figure c

5.3.1 Basics of GPU-Accelerated Image Processing with CLIJ

Every ImageJ macro, which uses CLIJ functionality, needs to contain some additional code sections. For example, this is how the GPU is initialized:

figure d

In the first line, the parameter cl_device can stay blank, imposing that CLIJ will select automatically an OpenCL device, namely the GPU. One can specify the name of the GPU in brackets, for example nVendor Awesome Intelligent. If only a part of the name is specified, such as nVendor or some, CLIJ will select a GPU which contains that part in the name. One can explore available GPU devices by using the menu Plugins> ImageJ on GPU (CLIJ2)> Macro tools> List available GPU devices. The second line, in the example shown above, cleans up GPU memory. This command is typically called by the end of a macro. It is not mandatory to write it at the beginning, however, it is recommended while elaborating a new ImageJ macro. A macro under development unintentionally stops every now and then with error messages. Hence, a macro is not executed until the very end, where GPU memory typically gets cleaned up. Thus, it is recommended to write this line initially, to start at a predefined empty state.

Another typical step in CLIJ macros is to push image data to the GPU memory:

figure e

We first retrieve the name of the current image by using ImageJ’s built-in getTitle()-command, and save it into the variable input. Afterwards, the input image is stored in GPU memory using CLIJ’s push method.

This image can then be processed, for example using a mean filter:

figure f

CLIJ’s mean filter, applied to a 3D image, takes a cuboidal neighborhood into account, as specified by the word Box. It has five parameters: the input image name, the result image name given by variables, and three half-axis lengths describing the size of the box. If the variable for the result is not set, it will be set to an automatically generated image name.

Finally, the result-image gets pulled back from GPU memory and will be displayed on the screen.

figure g

Hence, result images are not shown on the screen until the pull() command is explicitly called. Thus, the computer screen is not flooded with numerous image windows, helping the workflow developer to stay organised. Furthermore, memory gets cleaned up by the clear() command, as explained above.

While developing advanced CLIJ workflows, it might be necessary to take a look into GPU memory to figure out which images are stored at a particular moment. Therefore, we can add another command just before the final clear()-command, which will list images in GPU memory in the log windows, as shown in Fig. 5.4:

figure h
Fig. 5.4
figure 4

List of images currently stored in GPU memory: In this case, there exists an image called t1-head-3.tif which corresponds to the dataset we loaded initially. Furthermore, there is another image, called CLIJ2_mean3DBox_result3, containing the result of the mean filter operation

As an intermediate summary, CLIJ commands in ImageJ macro typically appear as follows:

figure i

All CLIJ methods start with the prefix Ext., a convention by classical ImageJ, indicating that we are calling a macro extension optionally installed to ImageJ. Next, it reads CLIJ_, CLIJ2_ or CLIJx_ followed by the specific method and, in brackets, the parameters passed over to this method. Parameters are typically given in the order: input images, output images, other parameters.

The CLIJ identifier was introduced to classify methods originally published as CLIJ toolbox (Haase et al. , 2020). It is now deprecated since the official stable release of CLIJ2, which is the extended edition of CLIJ including more operations. Furthermore, there is CLIJx, the volatile experimental sibling, which is constantly evolving as developers work on it. Thus, CLIJx methods should be used with care as the X stands for eXperimental. Whenever possible, the latest stable release should be used. As soon as a new stable release is out, the former one will be deprecated. The deprecated release will be kept available for at least 1 year. To allow a convenient transition between major releases, the CLIJ developers strive for backwards-compatibility between releases.

5.3.2 Where CLIJ Is Conceptually Different and Why

When designing the CLIJ application programming interface (API), special emphasis was put on a couple of aspects to standardize and simplify image processing.

  • Results of CLIJ operations are per default not shown on screen. One needs to pull the image data from the GPU memory to display them in an ImageJ window. In order to achieve optimal performance, it is recommended to execute as many processing steps as possible between push and pull commands. Furthermore, only the final result image should be pulled. Pushing and pulling take time. This time investment can be gained back by calling operations, which are altogether faster than the classic ImageJ operations.

  • CLIJ operations always need explicit specifications of input and output images. The currently selected window in ImageJ does not play a role when calling a CLIJ command. Moreover, no command in CLIJ changes the input image. The only exception are commands starting with ‘set‘, which take no input image and overwrite pixels of a given output image. All other commands read pixels from input images and write new pixels into output images, as in the following example:

    figure j
  • CLIJ operations do not take physical units into account. For example, all radius and sigma parameters are provided in pixel units:

    figure k
  • If a CLIJ method’s name contains the terms “2D” or “3D”, it processes, respectively, two- or three-dimensional images. If the name of the method is without such a term, the method processes images of both types.

  • Images and image stacks in CLIJ are granular units of data, meaning that individual pixels of an image cannot be accessed efficiently by a GPU. Instead, pixels are processed in parallel, and therefore the whole image at once. Time-lapse data need to be split into image stacks and processed time point by time point.

  • CLIJ methods are granular operations on data. That means, they apply a single defined procedure to a whole image. Independent from any ImageJ configuration, CLIJ methods produce the same output given the same input. Conceptually, this leads to improved readability and maintenance of image processing workflows.

5.3.3 Hardware Suitable for CLIJ

When using CLIJ, for best possible performance it is recommended to use recent GPUs. Technically, CLIJ is compatible with GPU-devices supporting the OpenCL 1.2 standard (Khronos-Group , 2020), which was established in 2011. While OpenCL works on GPUs up to 9 years old, GPU devices older than 5 years may be unable to offer a processing performance faster than recent CPUs. Thus, when striving for high performance, recent devices should be utilized. When considering new hardware, image processing specific aspects should be taken into account:

  • Memory size: State-of-the-art imaging techniques produce granular 2D and 3D image data up to several gigabytes. Dependent on the desired use case, it may make sense to utilize GPUs with increased memory. Convenient workflow development is possible, if a processed image fits about 4–6 times into GPU memory. Hence, if working with images of 1–2 GB in size, a GPU with at least 8 GB of GDDR6 RAM memory should be used.

  • Memory Bandwidth: Image processing is memory-bound, meaning that all operations have in common that pixels are read from memory and written to memory. Reading and writing is the major bottleneck, and thus, GPUs with fast memory access and with high memory bandwidth should be preferred. Typically, GDDR6-based GPUs have memory bandwidths larger than 400 GB/s. GDDR5-based GPUs often offer less than 100 GB/s. So, GDDR6-based GPUs may compute image processing results about 4 times faster.

  • Integrated GPUs: For processing of big images, a large amount of memory might be needed. At time of writing, GDDR6-based GPUs with 8 GB of memory are available in price ranges between 300 and 500 EUR. GPUs with more than 20 GB of memory cost about ten fold. Despite drawbacks in processing speed, it also might make sense to use integrated GPUs with access to huge amounts of DDR4-memory.

5.4 The Workflow

5.4.1 Macro Translation

The CLIJ Fiji plugin and its individual CLIJ operations were programmed in a way which ensures that ImageJ users will easily recognise well-known concepts when translating workflows, and can use CLIJ operations as if they were ImageJ operations. There are some differences, aimed at improved user experience, that we would like to highlight in this section.

5.4.1.1 The Macro Recorder

The ImageJ macro recorder is one of the most powerful tools in ImageJ. While the user calls specific menus to process images, it automatically records code. The recorder is launched from the menu Plugins -> Macros -> Record.... The user can also call any CLIJ operation from the menu. For example, the first step in the nucleus segmentation workflow is to apply a Gaussian blur to a 2D image. This operation can be found in the menu Plugins> ImageJ on GPU (CLIJ2)> Filter> Gaussian blur 2D on GPU. When executing this command, the macro recorder will record this code:

figure l

All recorded CLIJ-commands follow the same scheme: The first line initializes the GPU, and explicitly specifies the used OpenCL device while executing an operation. The workflow developer can remove this explicit specification as introduced in Sect. 5.3.1. Afterwards, the parameters of the command are listed and specified. Input images, such as image1 in the example above, are pushed to the GPU to have them available in its memory. Names are assigned to output image variables, such as image2. These names are automatically generated and supplemented with a unique number in the name. The developer is welcome to edit these names to improve code readability. Afterwards, the operation GaussianBlur2D is executed on the GPU. Finally, the resulting image is pulled back from GPU memory to be visualized on the screen as an image window.

5.4.1.2 Fiji’s Search Bar

As ImageJ and CLIJ come with many commands and huge menu structures, a user may not know in which menu specific commands are listed. To search for commands in Fiji, the Fiji search bar is a convenient tool; it is shown in Fig. 5.5a. For example, the next step in our workflow is to segment the blurred image using a histogram-based (Otsu’s) thresholding algorithm, (Otsu , 1979). When entering Otsu in the search field, related commands will be listed in the search result. Hitting the Enter key or clicking the Run button will execute the command as if it was called from the menu. Hence, also identical code will be recorded in the macro recorder.

Fig. 5.5
figure 5

(a) While recording macros, the Fiji search bar helps to find CLIJ commands in the menu. (b) Auto-Completion in Fiji’s script editor supports a workflow developer in finding suitable commands and offers their documentation

5.4.1.3 The Script Editor and the Auto-Complete Function

In the Macro Recorder window, there is a Create-button which opens the Script Editor. In general, it is recommended to record a rough workflow. To extend code, to configure parameters, and to refine execution order, one should switch to the Script Editor. The script editor exposes a third way for exploring available commands: The auto-complete function, shown in Fig. 5.5b. Typing threshold will open two windows: A list of commands which contain the searched word. The position of the searched word within the command does not matter. Thus, entering threshold or otsu will both lead to the command thresholdOtsu. Furthermore, a second window will show the documentation of the respectively selected command. By hitting the Enter key, the selected command is auto-completed in the code, for example like this:

figure m

The developer can then replace the written parameters Image_input and Image_destination with custom variables.

5.4.1.4 The CLIJ website and API Reference

Furthermore, the documentation window of the auto-complete function is connected to the API reference section of the CLIJ website,Footnote 3 as shown in Fig. 5.6. The website provides a knowledge base, holding a complete list of operations and typical workflows connecting operations with each other. For example, this becomes crucial when searching for the CLIJ analog of the ImageJ’s Particle Analyzer, as there is no such operation in CLIJ. The website lists typical operations following Otsu thresholding, for example connected component labelling, the core algorithm behind ImageJ’s Particle Analyzer.

Fig. 5.6
figure 6

The online API reference can be explored using the search function of the internet browser, e.g. for algorithms containing Otsu (left). The documentation of specific commands contains a list of typical predecessors and successors (right). For example, thresholding is typically followed by connected component labelling, the core algorithm behind ImageJ’s Particle Analyzer

Exercise 1

Open the Macro Recorder and the example image NPCsingleNucleus.tif. Type Otsu into the Fiji search bar. Select the CLIJ2 method on GPU and run the thresholding using the button Run. Read in the online documentation which commands are typically applied before Otsu thresholding. Which of those commands can be used to improve the segmentation result?

5.4.2 The New Workflow Routine

While reconstructing the workflow, this tutorial follows the routines of the classic macro, and restructures the execution order of commands to prevent minor issues with pre-processing before thresholding. The processed dataset is a four-dimensional dataset, consisting of two spatial dimensions, X and Y, channels and frames. When segmenting the nuclear envelope in the original workflow, the first operation applied to the dataset is a Gaussian blur:

figure n

The stack parameter suggests that this operation is applied to all time points in both channels, potentially harming later intensity measurements. However, for segmentation of the nuclear envelope in a single time point image, this is not necessary. As discussed in Sect. 5.3.2, data of this type is not of granular nature and have to be decomposed into 2D images before applying CLIJ operations. We can use the method pushCurrentSlice to push a single 2D image to the GPU memory. Then, a 2D segmentation can be generated, utilizing a workflow similar to the originally proposed workflow. Finally, we pull the segmentation back as ROI and perform statistical measurements using classic ImageJ. Thus, the content of the for-loop in the original program needs to be reorganized:

figure o

The function nucseg takes an image from the nucleus channel and segments its nuclear envelope. Table 5.1 shows translations from original ImageJ macro functions to CLIJ operations.

Table 5.1 Translations of ImageJ macro to CLIJ macro, in the context of the example workflow

While the translation of commands for thresholding is straightforward, other translations need to be explained in more detail, for example the Analyze Particles command:

figure p

The advanced ImageJ macro programmer knows that this line does post-processing of the thresholded binary image, and executes in fact five operations: (1) It identifies individual objects in the binary image—the operation is known as connected component labeling; (2) It removes objects smaller than 800 pixels (size=800-Infinity pixel); (3) It removes objects touching the image edges (exclude); (4) It fills black holes in white areas (include); and finally (5) it again converts the image to a binary image (show=Masks). The remaining parameters of the command, circularity=0.00–1.00, display, and clear, are not relevant for this processing step, or in case of stack, specify that the operations should be applied to the whole stack slice-by-slice. Thus, the parameters specify commands which should be executed, but they are not given in execution order. As explained in Sect. 5.3.2, CLIJ operations are granular: When working with CLIJ, each of the five operations listed above must be executed, and in the right order. This leads to longer code, but also the code which is easier to read and to maintain:

figure q

Finally, the whole translated workflow becomes.Footnote 4

figure r
figure s

5.4.2.1 Further Optimization

So far, we translated a pre-existing segmentation workflow without changing processing steps, and with the goal of replicating results. If processing speed plays an important role, it is possible to further optimize the workflow, accepting that results may be slightly different. Therefore, it is necessary to identify code sections which have a high potential for further optimization. To trace down the time consumption of code sections, we now introduce three more CLIJ commands:

figure t

By including these lines at the beginning and the end of a macro, we can trace elapsed time during command executions in the log window, as shown in Fig. 5.7. In that way, one can identify parts of the code where most of the time is spent. In the case of the implemented workflow, connected component labelling appeared as a bottleneck.

Fig. 5.7
figure 7

An example of printed time traces reveals that (a) connected component labeling takes about 21 ms per slice, whereas (b) binary erosion, dilation, and subtraction of images takes about 1.3 ms per slice

In order to exclude objects smaller than 800 pixels from the segmented image, we need to apply (call) connected component labelling. By skipping this step and accepting a lower quality of segmentation, we could have a faster processing. This leads to a shorter workflow:

figure u

Analogously, an optimization can also be considered for the classic workflow. When executing the optimized version of the two workflows, we retrieve different measurements, which will be discussed in the following section.

Exercise 2

Start the ImageJ Macro Recorder, open an ImageJ example image by clicking the menu File> Open Samples> T1 Head (2.4M, 16 bit) and apply the Top Hat filter to it. In the recorded ImageJ macro, activate time tracing before calling the Top Hat filter to study what is actually executed when running the Top Hat operation and how long it takes. What does the Top Hat operation do?

5.4.3 Good Scientific Practice in Method Comparison Studies

When refactoring scientific image analysis workflows, good scientific practice includes quality assurance to check if a new version of a workflow produces identical results, within a given tolerance. In software engineering, the procedure is known as regression testing. Translating workflows for the use of GPUs instead of CPUs, is one such example. In a wider context, other examples are switching major software versions, operating systems, CPU or GPU hardware, or computational environments, such as ImageJ and Python.

Starting from a given dataset, we can execute a reference script to generate reference results. Running a refactored script, or executing a script under different conditions will deliver new results. To compare these results to the reference, we use different strategies, ordered from the simplest to the most elaborated approach: (1) comparison of mean values and standard deviation; (2) correlation analysis; (3) equivalence testing; and (4) Bland-Altman analysis. For demonstration purpose, we will apply these strategies to our four workflows:

  • W-IJ: Original ImageJ workflow;

  • W-CLIJ: Translated CLIJ workflow;

  • W-OPT-IJ: Optimized ImageJ workflow;

  • W-OPT-CLIJ: Optimized CLIJ workflow.

In addition, we will execute the CLIJ macros on four computers with different CPU/GPU specifications:

  • Intel i5-8265U CPU/ Intel UHD 620 integrated GPU;

  • Intel i7-8750H CPU/ NVidia Geforce 2080 Ti RTX external GPU;

  • AMD Ryzen 4700U CPU/ AMD Vega 7 integrated GPU;

  • Intel i7-7920HQ CPU/ AMD Radeon Pro 560 dedicated GPU;

5.4.3.1 Comparison of Mean Values and Standard Deviation

An initial and straightforward strategy is to compare mean and standard deviation of the measurements produced by the different workflows. If the difference between then mean measurements exceeds a given tolerance, the new workflow cannot be utilized to study the phenomenon as done by the original workflow. However, if means are equal or very similar, this does not allow us to conclude that the methods are interchangeable. Similar mean and standard deviation values are necessary, but not sufficient to prove method similarity. Results of the method comparison, using mean and standard deviation, are shown in Table 5.2.

Table 5.2 Mean ± standard deviation of measured signal intensities resulting from the different considered workflows and different CPU/GPU specifications

5.4.3.2 Correlation Analysis

If two methods are supposed to measure the same parameter, they should produce quantitative measurements with high correlation on the same data set. To quantify the level of correlation , Pearson’s correlation coefficient r can be utilized. When evaluated on our data, r values were in all cases above 0.98, indicating high correlation. These results are typically visualised by scatter plots, as shown in Fig. 5.8. Again, high correlation is necessary, but not sufficient, for proving method similarity.

Fig. 5.8
figure 8

Scatter plots of measurements resulting from the original ImageJ macro workflow versus the CLIJ workflow (left), the optimized ImageJ workflow (center), and the optimized CLIJ workflows (right). The orange line represents identity

5.4.3.3 Equivalence Testing

For proving that two methods A and B result in equal measurements with given tolerance, statistical hypothesis testing should be used. A paired t-test indicates if the observed differences are significant. Thus, a failed t-test is also necessary, but not sufficient to prove method similarity. A valid method for investigating method similarity is a combination of two one-sided paired t-tests (TOST). First, we define a lower and an upper limit of tolerable differences between method A and B, for example ±5%. Then, we apply the first one-sided paired t-test to check if measurements of method B are less than 95% compared to method A, and then the second one-sided t-test to check if measurements of method B are greater than 105% compared to method A. Comparing the original workflow (W-IJ) to the translated CLIJ workflow (W-CLIJ), the TOST showed that observed differences are within the tolerance (p-value < 1e-11).

5.4.3.4 Bland-Altman Analysis

Another method of analysing differences between two methods is to determine a confidence interval, as suggested by Altman and Bland (1983). Furthermore, so-called Bland-Altman plots deliver a visual representation of differences between methods, as shown in Fig. 5.9. When comparing the original workflow (W-IJ) to the CLIJ version (W-CLIJ), the mean difference appears to be close to 0.4, and the differences between the methods are within the 95% confidence interval [\(-0.4, 1\)]. The means of the two methods range between 40 and 53. Thus, when processing our example dataset, the CLIJ workflow (W-CLIJ) delivered intensity measurements of about 1% lower than the original workflow (W-IJ).

Fig. 5.9
figure 9

Bland-Altman plots of differences between measurements, resulting from the original ImageJ macro workflow (W-IJ) versus (left) the CLIJ workflow (W-CLIJ), (center) the optimized ImageJ workflow (W-OPT-IJ), and (right) the optimized CLIJ workflows (W-OPT-CLIJ). The dotted lines denote the mean difference (center) and the upper and lower bound of the 95% confidence interval

5.4.4 Benchmarking

After translating the workflow and assuring that the macro executes the right operations on our data, benchmarking is a common process to analyze the performance of algorithms.

5.4.4.1 Fair Performance Comparison

When investigating GPU-acceleration of image analysis procedures, it becomes crucial to obtain a realistic picture of the workflows performance. By measuring the processing time of individual operations on GPUs compared to ImageJ operations using CPUs, it was shown that GPUs typically perform faster than CPUs (Haase et al. , 2020). However, pushing image data to the GPU memory and pulling results back take time. Thus, the transfer time needs to be included when benchmarking a workflow. The simplest way is to measure the time at the beginning of the workflow and at its end. Furthermore, it is recommended to exclude the needed time to load from hard drives, assuming that this operation does not influence the processing time of CPUs or GPUs. After the open() image statement, the initial time measurement should be inserted:

figure v

Before saving the results to disc, we measure the time again and calculate the time difference:

figure w

The getTime() method in ImageJ delivers the number of milliseconds since midnight of January 1, 1970 UTC. By subtracting two subsequent time measurements, we can calculate the passed time in milliseconds.

5.4.4.2 Warm-up Effects

To ensure reliable results, time measurements should be repeated several times. As shown in Sect. 5.3, the first execution of a workflow is often slower than subsequent runs. The reason is the so-called warm-up effect, related to just-in-time (JIT) compilation of Java and OpenCL code. This compilation takes time. To show the variability of measured processing times between the original workflow and the CLIJ translation, we executed all the considered workflows in loops for 100 times each. To eliminate resulting effects of different and subsequently executed workflows, we restarted Fiji after each 100 executions. From the resulting time measurements, we derived a statistical summary in a form of the median speedup factor. Visualized by box plots, we have generated an overview of the performance of the four different workflows, executed on four tested systems.Footnote 5

5.4.4.3 Benchmarking Results and Discussion

The resulting overview of the processing times is given in Fig. 5.10. Depending on the tested system, the CLIJ workflow results in median speedup factors between 1.5 and 2.7. These results must be interpreted with care. As shown in (Haase et al. , 2020), workflow performance depends on many factors, such as the number of operations and parameters, used hardware, and image size. When working on small images, which fit into the so-called Levels 1 and 2 cache of internal CPU memory, CPUs typically outperform GPUs. Some operations perform faster on GPUs, such as convolution, or other filters which take neighboring pixels into account. By nature, there are operations which are hard to compute on GPUs. Such an example is the connected component labelling. As already described in Sect. 5.4.2, we identified this operation as a bottleneck in our here considered example workflow. Without this operation, the optimized CLIJ workflow performed up to 5.5 times faster than the original. Hence, a careful workflow design is a key to high performance. Identifying slow parts of the workflow and replacing them with alternative operations becomes routine when processing time is a relevant factor.

Fig. 5.10
figure 10

Box plots showing processing times of four different macros, tested on four computers. In the case of the of classic ImageJ macro, blue boxes range from the 25th to the 75th percentile of processing time. Analogously, green boxes represent processing times of the CLIJ macro. The orange line denotes the median processing time. Circles denote outliers. In case of the CLIJ workflow, outliers typically occur during the first iteration, where compilation time causes the warm-up effect

Exercise 3

Use the methods introduced in this section to benchmark the script presented in Sect. 5.3. Compare the performance of the mean filter in ImageJ with its CLIJ counterpart. Determine the median processing time of both filters, including push and pull commands when using CLIJ.

5.5 Summary

The method of live-cell imaging, in particular recording long-term time-lapses with high spatial resolution, is of increasing importance to study dynamic biological processes. Due to increased processing time of such data, image processing may become the major bottleneck. In this chapter, we introduced one potential solution for faster processing, namely by GPU-accelerated image processing using CLIJ. We also demonstrated a step-by-step translation of a classic ImageJ Macro workflow to GPU-accelerated macro workflow. Clearly, GPU-acceleration is suited for particular use cases. Typical cases are

  • processing of data larger than 10 MB per time point and channel;

  • application of 3D image processing filters, such as convolution, mean, minimum, maximum, Gaussian blur;

  • need for acceleration of workflows which take significant amount of time, especially if processing is 10 times longer than loading and saving images;

  • extensive workflows with multiple operations, consecutively executed on the GPU;

  • last but not least, utilizing sophisticated GPU-hardware with a high memory bandwidth, typically using GDDR6 memory.

When these needs/conditions are met, speedup factors of one or two orders of magnitude are feasible. Furthermore, the warm-up effect is crucial. For example, if the first execution of a workflow takes ten times longer than subsequent executions, it becomes obvious that at least 11 images have to be processed to overcome the effect and to actually save time. When translating a classic workflow to CLIJ, some refactoring is necessary to follow the concept of processing granular units of image data by granular operations. This also improves readability of workflows, because operations on images are stated explicitly and in the order of execution. Additionally, the shown methods for benchmarking and quality assurance can also be used in different scenarios, as they are general method comparison strategies. GPU-accelerated image processing opens the door for more sophisticated image analysis in real-time. If days of processing time can be saved, it is worth investing hours required to learn CLIJ.

Solutions to the Exercises

Exercise 1

While applying image processing methods, the ImageJ Macro recorder records corresponding commands. This offers an intuitive way to learn ImageJ Macro programming and CLIJ. After executing this exercise, the recorder should contain code like this:

figure x

It opens the dataset, initializes the GPU, pushes the image to GPU memory, thresholds the image, and pulls the resulting image back to show it on the screen.

The Fiji search bar allows to select CLIJ methods. The corresponding dialog gives access to the CLIJ website, where the user can read about typical predecessor and successor operations. For example, as shown in Sect. 5.4.1 in Fig. 5.6, operations such as Gaussian blur, Mean filter, and Difference-Of-Gaussian are listed, which allow an improved segmentation, because they reduce noise.

Exercise 2

The recorded macro, adapted to print time traces, looks like this:

figure y

The traced times, while executing the Top Hat filter on the T1-Head dataset, are shown in Fig. 5.11. The Top Hat filter is a minimum filter applied to the original image, which is followed by a maximum filter. The result of these two operations is subtracted from the original. The two filters take about 60 ms each on the 16 MB large input image, the subtraction takes 5 ms. The Top Hat filter altogether takes 129 ms. Top hat is a technique to subtract background intensity from an image.

Fig. 5.11
figure 11

While executing the Top Hat filter, activated time tracing reveals that this operation consists of three subsequently applied operations: a minimum filter, a maximum filter and image subtraction

Exercise 3

For benchmarking the mean 3D filter in ImageJ and CLIJ two example macros are provided online.Footnote 6 We executed them on our test computers and determined median execution times between 1445 and 5485 ms for the ImageJ filter and from 81 to 159 ms for the CLIJ filter, respectively.

Take-Home Message

In this chapter you learned how a classic ImageJ macro can be translated to a GPU-accelerated CLIJ macro. Image processing on a CPU might become time-consuming, especially when processing large datasets, such as complex time-lapse data. Therefore, it is important to rethink parts of the workflow and to speed it up by forwarding processing tasks to a GPU. For an optimal exploitation of the computing power of GPUs, it is recommended to process data time-point by time-point, and also not to apply filters to the whole time-lapse at once. Furthermore, we introduced strategies for a good scientific practice on benchmarking and quantitative comparison of results between an original and a GPU-accelerated workflow to assure that the GPU-accelerated workflow performs with equal measurement results and under a given tolerance.