This section illustrates how the cluster and cloud cost models can be used to determine the cost of EEG computations and data storage. For cluster cost calculations, three hypothetical cluster configurations will be used, whose parameters are listed in Table 4. The first, “BUDGET”, configuration is based on entry-level desktop computers. It is assumed that the BUDGET cluster will use the existing network infrastructure at no additional cost, and no replacements or cluster administrator staff costs are planned over the lifespan of the cluster. The second, “NORMAL”, configuration, which is built from more powerful desktop computers, includes its own dedicated cluster network. As before, no replacement, staff and cooling costs are considered. The third configuration is a HIGH_END cluster, built from server-grade computers using a dedicated high-speed network and large-capacity network storage device. Replacement, staff and cooling costs are also included in this configuration.
Table 4 Cost model parameters for three different cluster configurations Cost of a computing job
Execution cost on a local cluster
The cost of a particular EEG processing job on a cluster infrastructure can be calculated using Eq. (1) after substituting the terms defined in Eqs. (2)-(7). Using the parameters in Table 4, and assuming 100% cluster utilisation and a 4-year lifespan, the cost of 1 cluster CPU-hour, \({C}_{hour}={C}_{cluster}(N)/(N\cdot 30\cdot 24)\), of the three clusters are \({C}_{hour}^{A}=0.0273\) USD, \({C}_{hour}^{B}=0.0614\) USD, and \({C}_{hour}^{C}=0.6623\) USD. Assuming an EEG experiment with 2 × 30 subjects and 1-h per-subject processing times, the cost of the 60-h total job is thus \({C}_{job}^{A}=60{\cdot C}_{hour}^{A}=1.64\) USD, and similarly, \({C}_{job}^{B}=3.68\) USD, \({C}_{job}^{C}=39.74\) USD for clusters A, B and C, respectively. As shown in Fig. 3, increasing the size of the cluster reduces the overall execution time and, as discussed in Sect. 3.2.1, the cost of the job is independent of the number of cluster nodes used during the computation if subject processing can be performed independently.
The above hourly rates are computed at 100% cluster utilisation. This level of utilisation can only be achieved if all nodes of the cluster compute analysis jobs continuously, in 24/7 mode. If, on the other hand, the cluster is used only by one research group or shared by a small number of groups with intermittent, burst-like usage pattern, a much lower utilisation rate is achieved in practice. As the utilisation level falls, both the hourly and the final job costs will increase significantly. The utilisation factor \(U\) of the cluster is the ratio of the actual CPU hours used for computation to the total available cluster CPU hours. When a cluster is used \({h}_{total}\) CPU-hours per month for job processing, the utilisation factor becomes
$$U = \frac{{h_{total} }}{N \cdot 30 \cdot 24},$$
which is inverse proportional to cluster size \(N\). To account for the effect of utilisation on the CPU-hour cost, a utilisation-corrected cluster CPU-hour cost \({C}_{hour}^{*}\) should be used, defined as
$$C_{hour}^{*} = \frac{{C_{cluster} \left( N \right)}}{U \cdot N \cdot 30 \cdot 24} = \frac{{C_{cluster} \left( N \right)}}{{h_{total} }}.$$
(16)
From this, the cost of the compute jobs can be expressed as \({C}_{jobs}^{*}={{h}_{jobs}\cdot C}_{hour}^{*}.\) Note that if the cluster is not shared with other groups, i.e., \({h}_{jobs}={h}_{total}\), the job cost reduces to \({C}_{jobs}^{*}={C}_{cluster}(N)\), demonstrating that the cost of a monthly job unit is equal to the monthly cost of the cluster, regardless of how many hours the cluster is in use. Figure 4 illustrates the cost of a cluster CPU-hour in function of cluster utilisation when the cluster is used 240 h a month for EEG data processing tasks. These results explain why a centralised cluster facility shared with several other user groups is preferable as this has the potential to achieve higher utilisation rates and lower unit cost.
Figure 5 compares the CPU-hour costs of non-shared (dedicated) clusters with shared centralised clusters and cloud virtual machines. Shared cluster utilisation levels are set to 90, 50 and 25%. The results show that the CPU-hour cost of the non-shared, dedicated clusters (Cluster A and C in Fig. 5) increases quickly with the cluster size. The CPU cost of the shared clusters decreases with cluster size and converges to a constant value. This is demonstrated especially for Cluster C 90%, 50% and 25% in Fig. 5 (right panel). Note that the CPU cost in the cloud is constant and whether this cost is lower or higher than the local cluster CPU cost depends on the cluster configuration. In Fig. 5, the shared cluster CPU cost is always lower than the cloud cost for Cluster A, but always higher for Cluster C. Since cloud providers use enterprise-grade servers that are comparable to Cluster C type nodes of our hypothetical cluster configurations, it is not surprising that the 90% percent utilisation Cluster C approaches the cloud CPU-cost. It can be concluded, that using a cluster solely by a single group is clearly a bad economical decision; it results in orders of magnitude higher CPU-hour costs than other alternatives. Shared, centralised clusters are preferable as on-premises facilities but if high-quality equipment is used, higher than 90% utilisation rate must be maintained for achieving cloud-competitive cost.
Execution cost in the cloud
The cost of executing \(h\) hours per month for EEG analysis jobs in the cloud using \(p\) virtual machine instances is simply the product of the cost of the virtual machine instance and the time it is used for during job execution, which is expressed as
$$C_{job}^{cloud} = \frac{h}{p} \cdot \left( {p \cdot c_{VM} } \right) = h \cdot c_{VM}$$
(17)
The cluster and cloud job cost models allow us to compare the cost of the two types of systems for actual EEG data processing tasks. Figure 6 compares the cost of processing in our hypothetical Cluster A, B and C configurations to that of the cloud, in function of the computational load given in number of hours per month. As before, a 4-year lifespan is assumed for the clusters. The figure shows that the cluster-cloud break-event point depends on the cluster size and the processing load. For our initial 240-h per month target computing load, Cluster A is cheaper than the cloud only for p = 1; clusters B and C are more expensive than the cloud irrespective of cluster size at this target load. Large compute loads give different results. For cluster A, the cloud becomes more expensive at h > 800 (p = 10, 8-core VM) and h > 1500 (p = 10, 4-core VM) or h > 2200 (p = 30, 8-core VM). The thresholds for cluster B are h > 1200 (p = 10, 8-core VM) and h > 2200 (p = 10, 4-core VM). For cluster C, the cloud is a cheaper option if p > 5 and h < 3000. If, however, our goal is to reduce execution time considerably (p > 10), the cloud is the best alternative for moderate loads. As we increase the number of compute nodes (p > 30), the load threshold after which the cloud becomes more expensive is increasing accordingly.
Cost of storage
Local cluster
If a local cluster uses the built-in hard disks of the nodes (e.g. 1 TB/computer) and a distributed file system for storing data (e.g. in cluster configurations A and B) provides sufficient storage capacities for the analyses, the cost of data storage is already part of the operational cost of the cluster. If more storage is required, additional hard disks can be installed in the cluster nodes. The price of 1 GB hard disk storage is approximately 0.02 USD at current HDD prices. Using 4–6 TB disks per node, large clusters can provide storage capacity in the range of 200 TBs, which is more than sufficient for an EEG research group.
In clusters that use a dedicated network storage, the price is increased by the cost of the storage hardware. These systems normally operate with redundant RAID storage schemes, so the achieved storage capacity can be as low as 50% of the raw disk capacity, practically doubling the per-gigabyte cost. Depending on the hardware chosen, the number of disks in a storage unit can vary from 4 to 24, resulting in an overall storage capacity of (assuming 6 TB disks and 50% storage rate) 12 to 72 terabytes. Assuming EEG files of size 0.5 GB, and that a group is generating 240 files a month for 4 years, at least 6 TB storage capacity is required, which will be provided by this storage option.
The actual per-gigabyte storage cost depends on the chosen HDD models, but practically it varies between 0.01 and 0.1 USD. It is important that the per-gigabyte cost in a local cluster is projected to the entire lifetime of the cluster, unlike in the case of cloud storage options. Thus, storing e.g. 10 TB data for 1 or 4 years at 0.02 USD/GB base price will cost the same amount, 200 USD.
Cloud storage
Cloud providers charge for storage on a gigabyte-per-month basis. The cost of storing 1 GB data for one month is approximately 0.02 USD at each major cloud providers. Archival storage with infrequent access can be accessed cheaper. In this analysis, the cost of cloud storage is examined in three different scenarios; (i) storing an existing set of data without adding new measurements (e.g. for archival or sharing data with others), (ii) uploading a new set of measurement data each month, and (iii) the combination of the first two cases, i.e. first uploading existing datasets then adding new measurements monthly. The cost of these cases are calculated next using the general cloud storage cost formula given in Eq. (10).
Storing existing data Assuming an existing data set of size \(v\) gigabytes, the cost of storing the data in the cloud for \(m\) months is \({C}_{storage}\left(m\right) = m\cdot v\cdot {c}_{st} .\) As an example, storing e.g. 500 GB or 10 TB data in the cloud at cst = 0.026 USD GB/month base price for 4 years would cost 624 or 12,480 USD, respectively. Note that these prices are significantly higher than the cluster storage prices that are based on HDD-only cost. If data are uploaded for archival purposes and accessed infrequently, significantly reduced prices are available. Using backup options (nearline: 0.010 USD and coldline: 0.007 USD), the cost of storing the same amount of data can be reduced to 240 and 4800 USD (nearline) or 158 and 3360 USD (coldline), respectively.
Monthly upload only If only new measurements of size \(d\) gigabytes are uploaded to the cloud each month in a uniform manner over a period of \(m\) months, and stored accumulatively, the cost of storage is calculated as \({C}_{storage}\left(m,d\right)=m\left(m-1\right)d\cdot {c}_{st}/2 .\) As an illustration, assuming 2 × 30 subjects measured weekly (or monthly), the amount of data to upload is approx. 13.2 GB (52.8 GB) at fs = 512 Hz or 52.8 GB (211 GB) at fs = 2048 Hz. Calculating with d1 = 50 GB and d2 = 200 GB per month, the cost of cumulative data storage for 4 years is 1466.4 and 5865.6 USD, respectively. Figure 7 plots the cost function for different monthly upload values in the range of 1 to 500 GB/month using standard ‘hot’ as well as ‘coldline’ storage rates.
Combined accumulated data storage When the previous two options are combined, the general cloud storage cost calculation formula (10) should be used. Figure 8 shows how the upload of an initial 1 TB data effects to overall monthly accumulated storage cost. It is important to highlight that the above storage cost estimation is based on using 16 bit/sample datafile formats. If data is stored in 3-byte per sample file format (e.g. BDF), or 4-byte (single, float) or 8-byte (double) data formats, the cost of storage will increase considerably. Storing intermediate data files will also increase storage costs.
If data is accessed infrequently after processing, the cost of long-term storage can be decreased considerably if data is moved from standard storage to coldline once processing is complete. This strategy will use the expensive ‘hot’ storage only for the time of processing. The cost function for this cost-optimised version is
$$C_{storage}^{*} \left( {m,d} \right) = mdc_{st}^{hot} + \frac{m}{2}\left[ {2v + \left( {m - 1} \right)d} \right]c_{st}^{coldline} .$$
(18)
Figures 7 and 8 also illustrate to what extent coldline storage can reduce the overall storage cost (dotted line). The cost-optimised version is within the price range of the cost of a network attached storage system with a capacity of 32–64 terabytes.
Having discussed the cost of storage in the cloud, we should take a step back and assess whether data can be stored in the cloud safely. Medical data storage is governed by strict legal regulations. Their storage and transfer can be limited to an institute, a country or a higher entity, such as EU countries. If data can be stored in a cloud storage system, there need to be guarantees that data is stored securely. Data by default is stored encrypted when written to disks. If required, encryption keys can be managed by customers (customer-managed encryption keys) or if keys must be stored locally, by customer-supplied encryption keys. Data communication is also protected by default Encryption in Transit methods that can be strengthened, if necessary, by the use of IPsec, Virtual Private Network to cloud resources. Shielded virtual machines are normally available at no extra cost that provide stronger protection against tampering with during execution. Cloud vendors also comply with several regulatory requirements, e.g. with the Health Insurance Portability and Accountability Act of 1996 (HIPAA) in case of medical data. All these measures result in highly secure data storage systems that might be more secure and trusted than an on-premises data storage system.
Infrastructure selection
After the preceding, individual analyses of the computation and data storage costs, we now turn to comparing the cost of a local on-premises cluster with the cloud in terms of the total cost of ownership (TCO). As seen in the preceding sections, cluster cost is close to constant whereas cloud cost increases with usage, i.e. as the compute load and size of the stored data increase. The goal of the TCO analysis is to determine after how many months the cloud solution would become more expensive than a local cluster. Using the break-even model defined in Eq. (15), the three hypothetical clusters introduced earlier are compared with the cloud infrastructure, using a representative set of processing workflow parameters. Equation (15) is solved for \(m\) (months) at varying input parameter values for \(h\) (workload size in hours) and \(d\) (monthly upload in GBs). Table 5 presents the results for the break-even values for different cluster types and sizes, expressed in years. Values highlighted in bold indicate that for the parameter combinations of that cell, the cloud is more cost-effective than an on-premises cluster when a 4-year useful lifetime is planned for the clusters.
Table 5 Results of the cluster-cloud break-even analysis The range of the input parameters is established based on common EEG measurement and processing settings. Assuming that the number of electrodes vary from 19 to 256, the sampling frequency from 256 to 2048 Hz, and data stored as 2 bytes/sample, the size of 1-min measurement varies between 570 kB and 60 MB. Assuming an average experiment duration of 10 min, the average data file size is between 5.6 and 600 MBs. If the number of subjects measured each month varies between 20 and 100, the total uploaded data is in the range of 112 MB–58.6 GB. In order to incorporate longer measurements, the final range for \(d\) is from 1 to 300 GB. The compute load is varied from 50 h (approximately 1 h/subject) up to 3000 h. For the one and 5-node clusters, the maximum is 720 and 1300 h. The maximum is set to represent computational problems requiring tens of compute hours per subject. The cost of the cloud virtual machine instance \({c}_{VM}\) is set to 0.2 USD/h whereas the cost of storage \({c}_{st}\) is 0.02 GB/month.
The financial analysis indicates that using the cloud for short term is an economically justifiable alternative to running a local cluster system. If a team does not want to commit long term to a local infrastructure, the cloud option should be preferred. High-end clusters are always more expensive than the cloud. Budget and normal cluster systems are only cost-efficient if operate under very high compute load and store large amounts of data. To conclude the analysis, assuming the typical usage characteristics of an EEG research group, any cluster consisting of more that 10–15 nodes will be more expensive than the cloud solution.
Note that the calculations in this paper rely on data obtained at the time of writing and the model only serves to indicate major cost trends. For making real decisions, calculations should be carried out using up-to-date hardware cost and cloud pricing data and specific local operational factors (utility and staff cost) should be taken into consideration.
Budget planning
In addition to finding out when a cloud infrastructue is more expensive, another important question is to determine how much data one can process in the cloud from the budget originally allocated for creating an on-premises cluster. In this scenario, we are interested in the hours \(h\) we can process and data \(d\) we can upload each month during the lifespan \(m\) of the cluster. Changing the direction of the inequality in (15) we are now searching for values of \(h\) and \(d\) that satisfy
$$C_{cluster} \left( {m,h} \right) > m \cdot h \cdot c_{VM} + \frac{m}{2}\left[ {2v + \left( {m - 1} \right)d} \right]c_{st} + C_{network}^{cloud}$$
(19)
Fortunately, \(h\) and \(d\) are related, since the processing time \(h\) depends on the data size \(d\)
The exact value of \(A\) can be determined with trial runs, after which the inequality can be solved for \(d\).
If, instead of the processed data size, one is interested in calculating the number of experiments that can be analysed from the cost of the target cluster, \(h\) and \(d\) should be replaced by \(k\cdot {h}_{exp}\) and \(k\cdot {d}_{exp}\), where \(k\) is the number of experiments, \({h}_{exp}\) is the processing time of the experiment of size \({d}_{exp}\). The experiment data size is dependent of the the number of electrodes, sampling rate, bytes-per-sample, and the measurement length. If these parameters are known, the inequality can be solved for \(k\).
$$C_{cluster} \left( {m,h} \right) - C_{network}^{cloud} > m \cdot k \cdot h_{\exp } \cdot c_{VM} + \frac{m}{2}\left[ {2v + \left( {m - 1} \right)k \cdot d_{\exp } } \right]c_{st}$$
$$k < \frac{{C_{cluster} \left( {m,h} \right) - C_{network}^{cloud} - m \cdot v \cdot c_{st} }}{{m\left( {h_{exp} \cdot c_{VM} + \frac{{\left( {m - 1} \right)}}{2}d_{exp} \cdot c_{st} } \right)}}$$
Finally, we look at how the model can be used in project budgeting. Assuming that the cloud infrastructure has already been decided upon, the next step in the planning process is the calculation of the amount required in total and each month for carrying out the necessary data analysis tasks. By replacing \({C}_{cluster}(m,h)\) with the unknown budgeted cost \(B\), the following equation provides the solution. From \(B\), the monthly cloud budget can be derived easily.
$$B = m \cdot k \cdot h_{\exp } \cdot c_{VM} + \frac{m}{2}\left[ {2v + \left( {m - 1} \right)k \cdot d_{\exp } } \right]c_{st} + C_{network}^{cloud}$$
(20)
The aforementioned results illustrate that choosing between a local cluster infrastructure and the cloud is a complex task. The outcome depends on the complex interaction of a number of input parameters as well as EEG measurement settings and workload characteristics. In general, it can be observed that if very high computational capacity is required, an adequately sized local cluster may be more cost-effective, provided the efficiency of the cluster is kept high (e.g. shared with other groups). Similarly, if large amount of measurement data should be stored accumulatively in a frequently accessible manner, a local storage option might be more advisable. On the other hand, if EEG processing is characterised by relatively light compute and storage requirements, cloud execution is expected to be more cost effective. The presented models can help in the detailed analysis of these parameters within the specific data processing context of the given research group.
In those cases, for which cloud processing is more cost effective, there are additional benefits as well. The elastic scaling of the cloud can drastically decrease job execution times. These can be especially important in large cohort studies (e.g. groups with beyond 100 subjects or in large clinical studies); execution speed can increase up to two orders of magnitude without additional cost. The models in this paper assumed that the programs are executed in the cloud the same way as locally, without modification. Using parallel algorithms, execution time can be decreased considerably, even for individual subject analysis. This opens up new opportunities both for further reduction in execution time and for using more sophisticated models and analysis methods that would otherwise be too time-consuming when executed the traditional way. Every cloud vendor offers high-performance GPUs (graphics processing units) that can provide several orders of magnitude higher computational performance than multi-core CPUs. While GPU hourly costs (1.46–2.48 USD/h) are typically higher than CPU rates (0.1–4 USD/h), an execution speedup of, say, 300x, can reduce both the execution time and the overall cost of the compute resource to a fraction of the original price. The downside is that GPU execution is a disruptive technology, code must be modified substantially to efficiently execute on GPUs. With increasing support for GPUs in existing EEG execution frameworks and the implementation of new, efficient GPU data processing algorithms, this mode of execution has the potential to revolutionise EEG processing.