Introduction

Analysis of neuroimaging data is often computationally demanding. For studies involving functional magnetic resonance imaging (fMRI), voxel-based morphometry (VBM), and diffusion tensor imaging (DTI), it is common to collect data from at least 15 subjects (Friston, Holmes & Worsley, 1999). The size of a single fMRI data set is usually of the order of 64 × 64 × 30 × 200 elements (200 volumes with 30 slices, each containing 64 × 64 pixels), and high-resolution volumes for VBM often consist of 256 × 256 × 128 voxels. While increasing statistical power, the large amount of data prevents the use of advanced statistical models, since the calculations required can easily take several weeks. This is especially true for extremely large data sets—for example, the freely available rest data sets in the 1,000 functional connectomes project (Biswal et al., 2010) requiring about 85 GB of storage. With stronger magnetic fields and more advanced sampling techniques for MRI, the spatial and temporal resolution of neuroimaging data will also improve in the near future (Feinberg & Yacoub, 2012), further increasing the computational load.

In this short introductory overview, we will therefore show some examples of how affordable PC graphics hardware, more commonly known as graphics processing units (GPUs), enables the use of more realistic analysis tools. The possibility of performing general computations on GPUs has made it possible to, for example, replace traditional parametric methods with nonparametric alternatives, which would otherwise be prohibitively computationally demanding. Nonparametric methods, such as a Monte Carlo permutation test (Dwass, 1957), make fewer assumptions than do parametric ones and are, therefore, applicable over a wider range of data structures. Kimberg, Coslett and Schwartz (2007) summarized the core of our review: “We can adopt the perspective that, for many purposes, parametric statistics are a compromise that we have been forced to live with solely due to the cost of computing. That cost has been dropping steadily for the past 50 years, and is no longer a meaningful impediment for most purposes.”

Another route for escaping the shackles of simple parametric models is by using newly developed, flexible semiparametric models from statistics and machine learning. Although often parametric, such models make far fewer restrictive functional and distributional assumptions and, therefore, span a wide array of potential data structures. In this respect, they are similar to nonparametric methods, and we shall refer to both classes of methods as good alternatives to (simple) parametric models. It is common practice to use a Bayesian prior distribution to efficiently regulate these otherwise highly overparametrized models. Bayesian algorithms can be highly computationally demanding, and our review argues that there is a huge potential for using GPUs to speed up Bayesian computations on neuroimaging data.

The review will focus on statistical analysis of fMRI data but will also consider VBM, DTI, and the spatial normalization step required for multisubject studies.

What is a GPU?

A GPU is the computational component of a graphics card used in ordinary computers. The Nvidia GTX 690 graphics card, shown in Fig. 1, contains two GPUs, each consisting of 1,536 processor cores (units that execute program instructions). The physical location of the CPU and two graphics cards in an ordinary PC is shown in Fig. 2. The GPU's large number of cores can be compared with 4 processor cores for a central processing unit (CPU), which normally is used to perform calculations. A GPU core cannot, however, be directly compared with a CPU core. A CPU core is, in general, more powerful due to a higher clock frequency and a much larger cache memory (which stores data just read from the ordinary memory). GPUs can be very fast for a limited number of instructions, while CPUs can handle a much wider range of applications. The CPU is also better at running code with many if-statements, since it has support for so-called branch prediction.

Fig. 1
figure 1

The Nvidia GTX 690 graphics card, containing two GPUs and a total of 3,072 processor cores

Fig. 2
figure 2

The physical location of a CPU and two graphics cards in an ordinary PC. By using two Nvidia GTX 690 graphics cards, the user gets a PC equipped with 6,144 processor cores at the price of about $3,000

Graphics cards were originally designed for computer graphics and visualization. Due to the constant demand for better realism in computer games, the computational performance of a GPU has, during the last 2 decades, increased much more quickly than that of a CPU. The theoretical computational performance can today differ by a factor of ten in favor of the GPU. Graphics cards are also inexpensive, since ordinary consumers must be able to afford them. The Nvidia GTX 690 is one of the most expensive cards and costs about $1,000.

Why use a GPU?

The main motivation for using a GPU is that one can save time or apply an advanced algorithm, instead of a simple one. In medical imaging, GPUs have been used for a wide range of applications (Eklund, Dufort, Forsberg & LaConte, 2012b). Some examples are to speed up reconstruction of data from magnetic resonance (MR) and computed tomography (CT) scanners and to accelerate algorithms such as image registration, image segmentation, and image denoising. Here, we will focus on how GPUs can be used to lower the processing time of computationally demanding algorithms and methods for neuroimaging.

Some disadvantages of GPUs are their relatively small amount of memory (currently 1–6 GB) and the fact that GPU programming requires deep knowledge about the GPU architecture. Consumer GPUs can also have somewhat limited support for calculations with double precision (64-bit floats), although single precision (32-bit floats) is normally sufficient for most image-processing applications. For the latest Nvidia generation of consumer graphics cards (named Kepler), the performance for double precision can be as low as 1/24 of the performance for single precision. For professional Kepler graphics cards, the performance ratio can, instead, be 1/3. Professional graphics cards are, however, more expensive. The Nvidia Tesla K20, for example, currently costs about $3,500.

How can a GPU be used for arbitrary calculations?

Initially, a GPU could be programmed only through computer graphics programming languages (e.g., OpenGL, DirectX), which made it hard to use a GPU for arbitrary operations. Despite this fact, using GPUs for general purpose computing (GPGPU) has been popular for several years (Owens et al., 2007). Through the release of the CUDA programming language in 2007, using Nvidia GPUs to accelerate arbitrary calculations has become much easier, since CUDA is very similar to the widely used C programming language. A large number of reports on large speedups, as compared with optimized CPU implementations, have since then been presented (Che, Boyer, Meng, Tarjan, Sheaffer & Skadron, 2008; Garland et al., 2008). A drawback of CUDA is that it only supports Nvidia GPUs, while the open computing language (OpenCLFootnote 1) supports any hardware (e.g., Intel CPUs, AMD CPUs, Nvidia GPUs, and AMD GPUs).

Why use a GPU instead of a PC cluster?

As compared with a PC cluster, which often is used for demanding calculations and simulations, an ordinary PC equipped with one or several graphics cards has several advantages. First, PC clusters are expensive, while a powerful PC does not need to cost more than $2000–3000 and can be bought “off the shelf.” Second, PC clusters can be rather large and use a lot of energy, while GPUs are small and power efficient. Third, it is hard for a single user to take advantage of the full computational power of a PC cluster, since it is normally shared by many users. On the other hand, a PC cluster can have a much larger amount of memory (but a single user normally cannot use more than a fraction of it). Table 1 contains a comparison between a PC cluster (from 2010) and a regular computer with several GPUs (from 2012). A good PC cluster is clearly a major investment, while a computer with several GPUs can be bought by a single researcher.

Table 1 A comparison between a GPU supercomputer, shown in Fig. 2, and a PC cluster in terms of cost, computational performance, amount of memory, and power consumption

How fast is a GPU?

A GPU uses its large number of processor cores to process data in parallel (many calculations at the same time), while a CPU normally performs calculations in a serial manner (one at a time). The main difference between serial and parallel processing is illustrated in Fig. 3. Multicore CPUs, which today are standard, can, of course, also perform parallel calculations, but most of the software packages used in the field of neuroimaging do not utilize this property. AFNIFootnote 2 is one of the few software packages that has multicore support for some functions, by using the OpenMPFootnote 3 (open multiprocessing) library. For many applications, such as image registration, a hybrid CPU–GPU implementation yields the best performance. The GPU can calculate a similarity measure such as mutual information in parallel, while the CPU runs a serial optimization algorithm.

Fig. 3
figure 3

The figure shows the main difference between a CPU and a GPU for processing of a small image consisting of 16 pixels. The numbers represent in which order the pixels are processed. The CPU processes the pixels one by one, while the GPU processes all the pixels at the same time

The performance of a GPU implementation greatly depends on how easy it is to run a certain algorithm in parallel. Fortunately, neuroimaging data are often analyzed in exactly the same way for each pixel or voxel. Many of the algorithms commonly used for neuroimaging are therefore well suited for parallel implementations, while algorithms where the result in one voxel depends on the results in other voxels may be harder to run in parallel.

The processing times for some necessary processing steps in fMRI analysis, for three common software packages (SPM,Footnote 4 FSL, Footnote 5 and AFNI), an optimized CPU implementation that uses all cores and a GPU implementation (Eklund, Andersson & Knutsson, 2012a) are stated in Table 2. The size of the fMRI data set used is 180 volumes of the resolution 64 × 64 × 33 voxels. The comparison has been done with a Linux-based computer equipped with an Intel Core i7-3770K 3.5 GHz CPU, 16 GB of memory, an OCZ 128 GB SSD drive, and a Nvidia GTX 680 graphics card with 4 GB of video memory. This is not a fair comparison, since the SPM software, for example, often writes intermediate results to file. This is possibly explained by the fact that an fMRI data set could not be fitted into the small memory of ordinary computers when the SPM software was created some 20 years ago. The different software packages and the GPU implementation also use different algorithms for motion correction and model estimation. For example, we use a slightly more advanced algorithm for estimation of head motion (Eklund, Andersson & Knutsson, 2010). Instead of maximizing an intensity-based similarity measure, the algorithm matches structures such as edges and lines. The comparison, however, shows that researchers in neuroimaging can save a significant amount of time by using a multicore CPU implementation that does not involve slow write and read operations to the hard drive. The multicore CPU implementation and the GPU implementation perform exactly the same calculations. Even more time can thus be saved by using one or several GPUs. It should be noted that SPM, FSL, and AFNI are flexible tools that can perform a wide range of analyses and that the mentioned GPU implementations currently can handle only a small subset of these.

Table 2 Processing times for three necessary steps in fMRI analysis, for three common software packages, a multicore CPU implementation, and a GPU implementation

This review will not consider any further details about GPU hardware or GPU programming. The interested reader is referred to books about GPU programming (Kirk & Hwu, 2010; Sanders & Kandrot, 2010), the CUDA programming guide and our recent work on GPU accelerated fMRI analysis (Eklund et al., 2012a). The focus will instead be on some types of methods and algorithms that can benefit from higher computational performance.

Methods and algorithms

Nonparametric statistics

In the field of fMRI, the data are normally analyzed by applying the general linear model (GLM) to each voxel time series separately (Friston, Holmes, Worsley, Poline, Frith & Frackowiak, 1995b). The GLM framework is based on a number of assumptions about the errors—for example, that they are normally distributed and independent. Noise from MR scanners is, however, neither Gaussian nor white but generally follows a Rician distribution (Gudbjartsson & Patz, 1995) and a power spectrum that resembles a 1/f function (A. M. Smith et al., 1999). To calculate p-values that are corrected for the large number of tests in fMRI, random field theory (RFT) is frequently used for its elegance and simplicity (Worsley, Marrett, Neelin & Evans, 1992). RFT, however, requires additional assumptions to be met. If any of the assumptions are violated, the resulting brain activity images can have false-positive “active” voxels or be too conservative to detect true positives. A simplistic model of the fMRI noise can, for example, result in biased or erroneous results, as shown in our recent work (Eklund, Andersson, Josephson, Johannesson & Knutsson, 2012c). RFT is also used for VBM (Ashburner & Friston, 2000) and DTI (e.g. (Rugg-Gunn, Eriksson, Symms, Barker & Duncan, 2001)), where the objective is to detect anatomical differences in brain structure. Recent work showed that the SPM software can also yield a high degree of false positives for VBM when a single subject is compared with a group (Scarpazza, Sartori, De Simone & Mechelli, 2013).

To complicate things further, multivariate approaches in neuroimaging (Habeck & Stern, 2010) can yield a higher sensitivity than univariate ones (e.g., the GLM), by adaptively combining information from neighboring voxels. Multivariate approaches are especially popular for fMRI (Björnsdotter, Rylander & Wessberg, 2011; Friman, Borga, Lundberg & Knutsson, 2003; Kriegeskorte, Goebel & Bandettini, 2006; LaConte, Strother, Cherkassky, Anderson & Hu, 2005; McIntosh, Chau & Protzner, 2004; Mitchell et al., 2004; Nandy & Cordes, 2003; Norman, Polyn, Detre & Haxby, 2006) but can also be used for VBM (Bergfield et al., 2010; Kawasaki et al., 2007) and DTI (Grigis et al., 2012). It is, however, not always possible to derive a parametric null distribution for these more advanced test statistics, to threshold the resulting statistical maps in an objective way.

A nonparametric test, on the other hand, is generally based on a lower number of assumptions—for example, that the data can be exchanged under the null hypothesis. Permutation tests were rather early proposed for neuroimaging (Brammer et al., 1997; Bullmore et al., 2001; Holmes, Blair, Watson & Ford, 1996; Nichols & Hayasaka, 2003; Nichols & Holmes, 2002) but are generally limited by the increase in computational complexity. To perform all possible permutations of a data set is generally not possible; a time series with only 13 samples can, for example, be permuted in more than 6 billion ways. Fortunately, a random subset of all the possible permutations (e.g., 10,000) is normally sufficient to obtain a good estimate of the null distribution. These subset permutation tests are known as Monte Carlo permutation tests (Dwass, 1957) and will here be called random permutation tests. For multivariate approaches to brain activity detection, training and evaluation of a classifier may be required in each permutation, which can be very time consuming. In the work by Stelzer, Chen and Turner (2013), 7 h of computation time was required for a classification-based multivoxel approach combined with permutation and bootstrap. Table 3 states the processing times for 10,000 runs of GLM model estimation and smoothing for the different implementations. Here, we have also included processing times for a multi-GPU implementation, which uses the four GPUs in the PC shown in Fig. 2.

Table 3 Processing times for 10,000 runs of GLM model estimation and smoothing for a single fMRI data set

Each GPU can independently perform a portion of the random permutations. Clearly, the long processing times of standard software packages prevent easy use of nonparametric tests.

The main obstacle for a GPU implementation of a permutation test is the irregularity of the random permutations, which severely limits the performance. Due to this, only two examples of GPU accelerated permutation tests have been reported (Shterev, Jung, George & Owzar, 2010; van Hemert & Dickerson, 2011). Fortunately, a random permutation of several volumes—for example, an fMRI data set or anatomical high-resolution volumes for VBM group analysis, can be performed efficiently on a GPU, if the same permutation is applied to a sufficiently large number of voxels (e.g., 512). In neuroimaging, one normally wishes to apply the same permutation to all voxels, in order to keep the spatial correlation structure. It is thereby rather easy to use GPUs to speedup permutation tests for neuroimaging.

By analyzing 1,484 freely available rest (null) data sets (Eklund et al., 2012c), a random permutation test was shown to yield more correct results than the parametric approach used by the SPM software. The main reason is that SPM uses a rather simple model of the GLM errors. Performing 10,000 permutations of 85 GB of data is equivalent to analyzing 850 000 GB of data. Table 4 states the processing times for 10,000 permutations of 1,484 fMRI data sets for the different implementations. To compare parametric and nonparametric approaches to fMRI analysis is clearly not possible without the help of GPUs (or a PC cluster). A random permutation test can also, as a bonus, be used to derive null distributions for more advanced test statistics. In our recent work (Eklund, Andersson & Knutsson, 2011a), we took advantage of a GPU implementation to objectively compare activity maps generated by the GLM and canonical correlation analysis based fMRI analysis (Friman et al., 2003), which is a multivariate approach. We have also accelerated the popular searchlight algorithm (Kriegeskorte et al., 2006), making it possible to perform 10,000 permutations including leave-one-out cross validation in 5 min instead of 7 h (Eklund, Björnsdotter, Stelzer & LaConte, 2013).

Table 4 Processing times for 10,000 runs of GLM model estimation and smoothing for 1,484 fMRI data sets

We have here focused on fMRI, but permutation tests can also be applied to VBM (Bullmore, Suckling, Overmeyer, Rabe-Hesketh, Taylor & Brammer, 1999; Kimberg et al., 2007; Silver, Montana & Nichols, 2011; Thomas, Marrett, Saad, Ruff, Martin & Bandettini, 2009) and DTI (e.g., proposed by Smith et al. (2006) and used by Chung, Pelletier, Sdika, Lu, Berman and Henry (2008) and Cubon, Putukian, Boyer and Dettwiler (2011)) data. Other nonparametric approaches include jackknifing, bootstrapping, and cross-validation. Biswal, Taylor and Ulmer (2001) used jackknife to estimate confidence intervals of fMRI parameters, while Wilke (2012) instead used jackknife to assess the reliability and power of fMRI group analysis. Bootstrap has been applied to fMRI (Auffermann, Ngan, & Hu 2002; Bellec, Rosa-Neto, Lyttelton, Benali & Evans, 2010; Nandy & Cordes, 2007) and DTI (Grigis et al., 2012; Jones & Pierpaoli, 2005; Lazar & Alexander, 2005), as well as VBM (Zhu et al., 2007). GPUs can, of course, also be used to speed up these other nonparametric algorithms (see, e.g., the review by Guo, 2012).

Bayesian statistics

Bayesian approaches are rather popular for fMRI analysis (Friston, Penny, Phillips, Kiebel, Hinton & Ashburner, 2002; Genovese, 2000; Gössi, Fahrmeir & Auer, 2001; see the review by Woolrich, 2012, for a recent overview). A major advantage of Bayesian methods is that they can incorporate prior information in a probabilistic sense and consider uncertainties in a straightforward manner. Bayesian methods are usually the preferred choice for richly parametrized semiparametric models (see the Introduction). Model selection and prediction are also much more straightforward in a Bayesian setting. To calculate the posterior distribution can, however, be computationally demanding if Markov chain Monte Carlo (MCMC) methods need to be applied. Genovese stated that a day of processing time was required for a single data set, for a simple noise model assuming spatial independence. In the work by Woolrich, Jenkinson, Brady and Smith (2004), fully Bayesian analysis of a single slice took 6 h—that is, about 7 days for a typical fMRI data set with 30 slices. Today, the calculations can perhaps be performed in less than an hour with an optimized CPU implementation. Variational Bayes (VB) can be used, instead, to derive an approximate analytic expression of the posterior distribution—for example, for estimation of autoregressive parameters for fMRI time series (Penny, Kiebel & Friston, 2003)—or in order to include spatial priors in the fMRI analysis (Penny, Trujillo-Barreto & Friston, 2005). A first problem is that a large amount of work may be required to derive the necessary equations, which often is straightforward for MCMC methods. Second, most VB applications assume that the posterior distribution factorizes into several independent factors, to obtain analytic updating equations. Third, tractability typically necessitates a restriction to conjugate priors (Woolrich et al., 2004). This restriction can be circumvented by instead using approximate VB.

To our knowledge, an unexplored approach to Bayesian fMRI analysis is to perform calculations with large spatiotemporal covariance matrices, in order to properly model nonstationary relationships in space and time. For example, a neighborhood of 5 × 5 × 5 voxels for 80 time samples can be considered as one sample from a distribution with 10,000 dimensions, rather than 10,000 samples from a univariate distribution. The main problem with such an approach is that the error covariance matrices will be of the size 10,000 × 10,000. GPUs can be used to speed up the inversion of these large covariance matrices, which is required in order to calculate the posterior distribution of the model parameters. To estimate the covariance matrix itself, a first approach can be to use a Wishart prior. A better prior can be obtained by analyzing large amounts of data—for example, the rest data sets in the 1,000 functional connectomes project (Biswal et al., 2010). Such a prior may, however, require MCMC algorithms for inference.

In the field of statistics, there is a growing literature on using GPUs to accelerate statistical inference (see the work by Guo, 2012, for a review on parallel statistical computing in regression analysis, nonparametric inference, and stochastic processes). Suchard, Wang, Chan, Frelinger, Cron and West (2010) focused on how to use a GPU to accelerate Bayesian mixture models. As a proof of concept, we made a parallel implementation of an MCMC algorithm with a tailored proposal density, described by Chib and Jeliazkov (2001). The processing time for their example in Section 3.1 was reduced from 18 s to 75 ms. The traditionally used MCMC algorithms are sequential and, therefore, not amendable to simple parallelization, except in a few special cases. In fMRI, this can be circumvented by running many serial MCMC algorithms in parallel (Lee, Yau, Giles, Doucet & Holmes, 2010)—for example, one for each voxel time series.

Ferreira da Silva (2011a) implemented a multilevel model for Bayesian analysis of fMRI data and combined MCMC with Gibbs sampling for inference. As was previously proposed, a linear regression model was fitted in parallel for each voxel. Random number generation was performed directly on the GPU, through the freely available CUDA library CURAND, to avoid time-consuming data transportation between the CPU and the GPU (see Ferreira da Silva, 2011b, for more details on the GPU implementation). Processing of a single slice took 452 s on the CPU and 65 s on the GPU. For a data set with 30 slices, this gives a total of 30 min, which still is too long for practical use. The graphics card that was used was somewhat outdated; a more modern card would likely yield an additional speedup of a factor of at least 10, resulting in a processing time of about 3 min, as compared with more than 3.5 h on the CPU.

For DTI, GPUs have been used to accelerate a Bayesian approach to stochastic brain connectivity mapping (McGraw & Nadar, 2007) and a Bayesian framework for estimation of fiber orientations and their uncertainties (Hernandez, Guerrero, Cecilia, Garcia, Inuggi & Sotiropoulos, 2012). This framework normally requires more than 24 h of processing time for a single subject, as compared with 17 min with a GPU. We believe that GPUs are a necessary component to enable regular use of Bayesian methods in neuroimaging, at least for methods that rely on a small number of assumptions.

Spatial normalization

Multisubject studies of fMRI, VBM, and DTI normally require spatial normalization to a brain template (Friston, Ashburner, Frith, Poline, Heather & Frackowiak, 1995a). This image-processing step is generally known as image registration but is often called “normalization” in the neuroimaging literature. A suboptimal registration can lead to artifacts, such as brain activity in the ventricles or artifactual differences in brain anatomy. In general, there is no perfect correspondence between an anatomical volume and a brain template (Roland et al., 1997). The spatial normalization step was early acknowledged as a problem for VBM (Bookstein, 2001), as well as for fMRI (Brett, Johnsrude & Owen, 2002; Nieto-Castanon, Ghosh, Tourville & Guenther, 2003; Thirion, Flandin, Pinel, Roche, Ciuciu & Poline, 2006) and DTI (Jones & Cercignani, 2010; Jones et al., 2002). Another problem is that MR scanners often do not yield absolute measurements, as CT scanners do, but relative ones. A difference in image intensity between two volumes can severely affect the registration performance. This is especially true for registration between T1- and T2-weighted MRI volumes, where the image intensity is inverted in some places (e.g., the ventricles). To solve this problem, one can, for example, take advantage of image registration algorithms that do not depend on the image intensity itself but, rather, try to match image structures such as edges and lines (Eklund, Forsberg, Andersson & Knutsson, 2011b; Heinrich et al., 2012; Hemmendorff, Andersson, Kronander & Knutsson, 2002; Mellor & Brady, 2004, 2005; Wachinger & Navab, 2012). Another approach is to steer the registration through an initial segmentation of brain tissue types. The boundary-based registration algorithm presented by Greve and Fischl (2009) uses such a solution to more robustly register an fMRI volume to an anatomical scan. For DTI, it is possible to instead increase the accuracy by combining several sources of information, such as a T2-weighted volume and a volume of the fractional anisotropy (Park et al., 2003). Additionally, nonlinear registration algorithms with several thousand parameters can often provide a better match between the subject's brain and a brain template, as compared with linear approaches, which optimize only a few parameters (e.g., translations and rotations).

While increasing robustness and accuracy, more advanced image registration algorithms often have a higher computational complexity. If a registration algorithm requires several hours of processing time, it does not have much practical value. Here, the GPU can once again be used to improve neuroimaging studies, by lowering the processing time to enable practical use of more robust and accurate image registration algorithms. Using GPUs to accelerate image registration is very popular. One reason is that GPUs can perform translations and rotations of images and volumes very efficiently, which is beneficial for image registration algorithms. Two recent surveys (Fluck, Vetter, Wein, Kamen, Preim & Westermann, 2011; Shams, Sadeghi, Kennedy & Hartley, 2010) mention about 50 publications on GPU accelerated image registration during the last 15 years. By using a GPU, it is not uncommon to achieve a speedup by a factor of 4–20, as compared with an optimized CPU implementation. As an example, Huang, Tang and Ju (2011) accelerated image registration within the SPM software package and obtained a speedup by a factor of 14.

Discussion

We have presented some examples of how affordable PC graphics hardware can be used to improve neuroimaging studies. The speed improvements that can be attained by analyzing fMRI data with one or several GPUs have also been documented. The main focus has been nonparametric and Bayesian methods, but we have also discussed how the spatial normalization step can be improved by taking advantage of more robust and accurate image registration algorithms. Another option is to use GPUs to explore the large space of dynamic casual models (Friston, Harrison & Penny, 2003), which can be very time consuming, or to apply nonparametric or Bayesian methods for brain connectivity analysis. An area not covered here is real-time fMRI (Cox, Jesmanowicz & Hyde, 1995; deCharms, 2008; LaConte, 2011; Weiskopf et al., 2003), where simple models and algorithms are often used to be able to process the constant stream of new data.

GPUs can clearly be used to solve a lot of problems in neuroimaging. The main challenge, as we see it, is how researchers in neuroscience and behavioral science can take advantage of GPUs without learning GPU programming. One option is to develop GPU accelerated versions of the most commonly used software packages (e.g., SPM, FSL, AFNI), which would make it easy for the users to utilize the computational performance of GPUs. Mathworks recently introduced GPU support for the parallel computing toolbox in MATLAB. Other options for acceleration of MATLAB code include using interfaces such as JacketFootnote 6 or GPUmat.Footnote 7 For the C and Fortran programming languages, the PGI accelerator model (Wolfe, 2010) or the HMPP workbench compiler (Dolbeau, Bihan & Bodin, 2007) can be used to accelerate existing code. A comparison between such frameworks has been presented by Membarth, Hannig, Teich, Korner and Eckert (2011). There is also a lot of active development of GPU packages for the Python programming language—for example, PyCUDAFootnote 8—which are likely to be used by Python neuroimaging packages like NIPYFootnote 9 in the near future. Recently, an interface between the statistical program R and the software packages SPM, FSL, and AFNI was developed by Boubela et al. (2012). Through this interface, preprocessing can be performed with standard established tools, while additional fMRI analysis can be accelerated with a GPU. As an example, independent component analysis was applied to 300 rest data sets from the 1,000 functional connectomes project (Biswal et al., 2010). The processing time was reduced from 16 to 1.2 h.

To conclude, using GPUs to speed up fMRI analysis that takes only a few minutes is unlikely to be worth the hassle and expense for most researchers. The true power of GPUs is that they practically enable algorithms for statistical analysis that rely on weaker assumptions. GPUs can also be used to take advantage of more robust and accurate algorithms for spatial normalization. Inexpensive PC graphics hardware can thus easily improve neuroimaging studies.