1 Introduction

Biology is a Big Data discipline, driven largely by the advancements in instrumentation that produce vast quantity of data. The key technologies are collectively known as Omics, consisting of genomics, transcriptomics, proteomics, metabolomics and several imaging techniques. Researchers often use a combination, or variations, of omics techniques (multi-omics) to understand biological systems in a comprehensive manner. Each of these omics techniques, can generate data in order of hundreds of gigabytes per experiment. The data magnitude scales up vastly in mulit-omics studies. While the omics revolution, provides us a magnifier to look at biology at a fine resolution, its success largely depends on the underlying data management techniques. Big Data and Omics, are fast evolving techniques in their respective domains, and it is becoming increasingly clear that the success of omics-revolution depends on highly scalable computing provisions that provide cost and energy efficient processing capabilities.

At the same time there are two important trends emerging in the Information Technologies industry: (a) Delivery of compute and storage resources as-a-service to provide various degrees of abstractions (platform as a service, software as a service, etc.) aimed to simplify interaction between end users and computational infrastructure, (b) Application of High Performance Computing (HPC) towards Data Centric (DC) architecture [14] that are designed to better handle workflows like the ones in omics.

In this paper, we will attempt to provide a road-map for Bioinformatics as a Service (BaaS) for distributed computing infrastructure, which can take advantage of modern HPC architectures to enable large scale omics data processing. We will focus on the Next Generation Sequencing (NGS) technologies, primarily applied for genomics and transcriptomics, and the main computational tasks involved in efficient processing of these datasets. We will explain our approach using a computational pipeline developed for processing whole transcriptomics datasets, also known as RNA-Seq datasets that are playing a central role in adaptation of NGS in clinical settings.

2 Motivation

Genomics data is growing at an unprecedented rate [15]. Much of the published raw data is available at the Sequence Read Archive (SRA) maintained by NIH/NCBI or its partnering institutions at EMBL-EBI (ENA - European Nucleotide Archive) and DNA Database of Japan (DDBJ). The magnitude of data produced from each experiment depends on many factors like the choice of sequencing platform, organism(s) being sequenced, expected read coverage, experimental goals etc. [5]. While a human whole genome sequencing data straight from the Illumina sequencer at 30x coverage is expected to be around 200 GB in size per sample [1], the whole transcriptome data can be much smaller around 2–10 GB in the similar settings [16]. An experiment will usually contain multiple samples and replicates. Specialised databases, like The Cancer Genome Atlas, ICGC etc., focus on cancer genomics and contain tens of thousands of samples obtained from a wider population. These samples need to be studied in tandem to create a comprehensive understanding of biological processes and phenotypes of interest. The very first steps towards creating a biological understanding from a set of omics data, is to process the dataset through appropriate bioinformatics pipelines. The pipelines churn through the raw datasets and create a reduced representation in form of community standard compressed file formats like SAM/BAM/VCF etc. These files create a basis for downstream data analysis through appropriate statistical and computational routines to generate biological insights.

2.1 Anatomy of a RNA-Seq Pipeline

RNA-Seq datasets are obtained to understand a range of biological phenomena, including understanding alternate gene splicing, gene fusion, changes in gene expression over time, or difference in gene expressions over different groups or treatments. Here, we will present a brief overview of a well accepted RNA-Seq data processing pipeline, called Tuxedo protocol [17], to lay the basis for our case-study. For simplicity, we assume that the experiment compares gene expression across only two conditions: wild-type vs mutant, and we want to compare the transcriptome profiles across conditions. The de facto standard for storing the output from a genome sequencer is the FASTQ format, a text based format that contains both the biological sequence as well as quality score. FASTQ files provide the starting point for the bioinformatics pipeline. For a typical paired-end read dataset, there would be a pair of FASTQ files for each technical replicate. Ideally, each experiment would contain several technical replicate for each biological replicate. The main computational steps involved in the protocol are read mapping, transcript assembly and detection of differentially expressed genes or transcripts. Figure 1 shows the information flow in the pipeline and the main tools used. The Tuxedo pipeline is based on TopHat [9] and Cufflinks [18] set of tools. TopHat is used to map reads from input FASTQ files to a user provided reference genome database and annotations. The read alignments are reported in form of compressed BAM files that are further processed through Cufflink to generate assembled transcripts for each condition. These assemblies are merged together by Cuffmerge to create a unified final transcript assembly file, which provides basis for comparing gene and transcript expression in each condition. The detection of differentially expressed gene or transcript is performed by Cuffdiff. Cuffdiff can also be used to perform an additional step of grouping transcripts into biologically relevant groups by performing a combined analysis of input FASTQ files with the final assembled transcripts. The results from Cuffdiff can be processed through a R [12] based package called CummeRbund for statistically relevant visualisation and final reporting. The entire pipeline consists of several stages, where some stages can be accelerated by employing efficient parallel and data movement strategies.

Fig. 1.
figure 1

The Tuxedo pipeline: (A) There are three main tasks in the pipeline - read mapping, transcript assembly, and detection of differential genes and transcripts. (B) The information flow in the pipeline and outputs produced at each stage. The input FASTQ files, shown in Orange and Light Blue, represent data from two experimental conditions. The same colour scheme is used to represent all the intermediate files generated for each condition. The steps in the pipeline are drawn to align with the sub-task listed in panel A. The input files are processed through different stages using tools as shown in the panel C of the figure. Each stage generates a set of output files that are used for downstream analysis. (C) List of tools grouped according to the sub-tasks listed in panel A. (Color figure online)

2.2 Scope for Parallel Constructs

While Fig. 1 shows an execution plan for a single sample with paired-end input files for two conditions, in reality, each experiment can consist of multiple samples and multiple conditions. Each pair of files, representing a particular sample under a particular condition, needs to be independently mapped and assembly-called using TopHat and Cufflinks respectively. These steps can be performed in an embarrassingly parallel way by assigning a dedicated resource - e.g. node, core, or accelerator - for each pair of files. When all the pair of FASTQ files have been processed to generate their respective BAM and GTF files, they need to be merged together through Cuffmerge to generate a final transcript assembly that needs to be analysed as a whole. Further, if there is a plan to perform grouping of reads through Cuffdiff, then this task is also specific to each pair of input files and can be performed independently in an asynchronous manner. In order to achieve significant speedup in case of large input FASTQ files on scales on tens of gigabytes, one can divide FASTQ files in chunks of manageable sizes and perform read mapping and assembly calling for each chunk and combine the BAMs and GTFs in the end. These strategies of divide and compute at certain stages in the pipeline provide means to manage memory, time and compute resources.

2.3 Portability and Reproducibility

While Tuxedo is a well understood and adopted pipeline, portability and reproducibility are two fundamental problems in bioinformatics. A typical workflow in bioinformatics requires a number of independently developed components that are stitched together through shell scripts, workflow standards or traditional programming constructs. These components are generally developed independently, using heterogeneous software engineering practices, by different research groups. Development of a pipeline that runs on a wide variety of platforms, and produces reproducible results is an important consideration in a clinical setting. The tools in a pipeline like Tuxedo offer a number of command-line arguments that need to be configured carefully and consistently to perform a reproducible study. The pipeline itself should be portable across compute architectures. The portability can be achieved through use of container technology like Docker [4], and reproducibility can be achieved using recently developed workflow languages like Common Workflow Language (CWL) [3]. To put reproducibility in context of our use case, lets focus on the first two steps in the pipeline, which are invocations of TopHat and Cufflinks respectively. The following boxes show how a call to these tools will look like when executed from a command-line interface.

tophat2 -p 8 -G user_annotation.gtf -o tophat_output reference_genome Exp_R1_1.fq Exp_R1_2.fq

cufflinks -p 8 -o cufflinks_output tophat_output accepted_hits.bam

Both TopHat and Cufflinks come with a range of command line arguments to provide specific instructions to the tool depending on the input data. Here we use only the necessary arguments for demonstration purposes. In both TopHat and Cufflinks calls, switches like -p, -o represent the number of processors to use, and the name of output directories respectively. In TopHat, there is an additional switch -G to indicate the user provided annotation file. In order to achieve reproducibility, it is vital that the repeated execution of these tools involve the same arguments every time they are executed. Suppose, a new user comes with additional information about a particular dataset and decides to execute TopHat with an additional parameter –library-type fr-secondstrand which forces TopHat to perform read alignment in a different manner, the results will be different from what will otherwise be produced in the default mode, and the reproducibility will not be achieved. Specifications like CWL, provide a framework to facilitate on-demand construction of command-line calls and data movement in a consistent manner across environments. A user would directly manipulate a CWL representation as opposed to explicitly setting command line options.

Figure 2 shows a Rabix [19] enabled visualisation of the CWL code for TopHat and Cufflinks. Toolkits like Rabix, provide a graphical interface to enable a bench-biologist to create pipelines in CWL, without worrying about the language specifications. While at the same time, these tools provide an environment for experienced programmers to write complex CWL pipelines. The visualisation can be quite useful to understand and debug a pipeline with tens of intermediate steps and multiple input parameters. The combination of workflow languages with containers are increasingly being adopted by the bioinformatics community, and truly provide a plug-and-play environment for development of workflows and pipelines. These constructs provide a modular environment to introduce a new step, or, modify an existing one, without disturbing the rest of the pipeline. This is extremely handy as some bioinformatics pipelines can involve tens of steps and tools where maintaining a synchronisation between the components can be a tiresome and error-prone process.

Fig. 2.
figure 2

Rabix based visualisation of CWL representation of TopHat and Cufflinks commands and their interconnectedness in the pipeline. The num_processor stands for -p in the above text. final_th_dir corresponds to -o tophat_output in the TopHat call and so on.

3 Bioinformatics as a Service (BaaS)

The Tuxedo pipeline discussed in the previous sections is an example of a typical bioinformatics workflow, where a set of tools are stitched together to achieve a specific goal. While the tools in Tuxedo were developed by the same research group, it is often not the case in large bioinformatics workflows, where tools are developed by independent research groups, and a bioinformatician often mixes and matches among a set of available alternatives to prepare a workflow for specific requirement. Different tools have different compute requirements and are diverse in their design and implementation. These are some fundamental problems in using different tools as components in a workflow: (a) Installation of the components –One can use a bioinformatics software, only if the software can be installed first. The software must be developed with good design and development practises. Many bioinformatics software can fail at this point itself. The dependent third party libraries can often be outdated, have obscure origin or not maintained anymore. The effort in trying to stitch dozens of such components together, where each component first needs to be built and tested before it can be a part of the pipeline can be daunting. (b) Security/Resilience –Due to the reasons mentioned above, one would not want an entire server to go down because a badly designed program misused the machine resources. At the very least, we expect the application to fail gracefully and the server to recover with minimum downtime. (c) Continuous changing landscape of the genomic technology –Genomics is among the fastest growing industries. The sequencing platforms, and the chemistry driving those platforms, both have been evolving rapidly, increasing the quality and quantity of the data produced. As a result, the software landscape is dynamic too, as it needs to keep pace with the latest platform updates. This dynamics require that there be a provision that a new software can be tested by easily plugging into the existing pipeline and discarded without affecting the entire workflow.

3.1 Essential Components

Our plan for implementing genomic pipelines as-a-service is shown in Fig. 3. We use a combination of HPC and virtualisation technologies, to build scalable, high performing and portable solutions, which can be used across private, public and hybrid Cloud environments. As a proof of concept we plan to use Docker, IBM®Spectrum LSF™ [7] and CWLEXEC [8]. All these components provide interoperable building blocks for creating as-a-service pipelines. Cluster design shown in Fig. 3 can be deployed in a private, public or hybrid Cloud. Portability can be achieved through custom-built Docker containers for each component in the workflow. Note that while our design is based on IBM POWER™ architectures, using the IBM-provided middle-ware components, the same design remains valid for building pipelines across different architectures.

Fig. 3.
figure 3

BaaS workflow. Major components: GUI front-end for submission and displaying results, CWL engine, batch scheduler like IBM Spectrum LSF, HPC cluster with high bandwidth Infiniband interconnect, parallel high performance filesystem and Docker repository. Note that both shared and node local storage can be used for better I/O performance.

3.2 Pipeline Execution

In the proposed environment, the control flow of the pipeline is as follows:

  1. 1.

    User creates the desired workflow using a graphical interface, or directly through CWL language constructs

  2. 2.

    The workflow specification is parsed by the CWL engine and translated into submission scripts understandable by a scheduler.

  3. 3.

    Pipeline jobs are submitted to a scheduler in the order prescribed in the CWL flow. Scheduler ensures optimal job placement on compute nodes.

  4. 4.

    Jobs are started on compute nodes. Required components like TopHat and Cufflink are pulled from a Docker repository on demand.

  5. 5.

    All stages of a pipeline are completed, results are forwarded to the user.

Note that multiple instances of the above flow can be executed concurrently within the same HPC cluster following the scheme provided in Sect. 2.2. Data accesses from multiple pipelines are handled by the high performing parallel file system.

Figure 4 shows control flow of a single Pipeline instance as sequences of jobs with dependencies. Each stage of the pipeline consists of multiple concurrent jobs, with each job operating on a chunk of data. A special wait job waits for all jobs in a given stage to finish, and kicks off jobs for the next stage in the pipeline. In our context, it could be understood as Cufflinks will be executed only after TopHat has finished the mapping process and a BAM file for each FASTQ pair is available. Similarly, Cuffmerge will be executed only after all the GFT files are available before they are merged together. Similar wait constructs will be applied in case of Cuffdiff as well. In case, of the input FASTQ files divided in chunks and mapped individually, the system will wait for all the chunks to be mapped and results merged together to create a unified BAM. This sequence is repeated as many times as there are stages in the pipeline. It is important to note that all dependencies of each job - data availability, other jobs, etc. - are known to a scheduler and not hidden inside a job. This assures the best possible HPC resource utilisation and high throughput.

Fig. 4.
figure 4

Pipeline Instance Control Flow. Each stage consists of multiple concurrent jobs. Special “wait” jobs are used for synchronisation between stages. “Final” job is responsible for post-processing of the overall pipeline results.

4 BaaS in Data-Centric Systems

For a biologist, studying transcriptional activities, pipelines like Tuxedo provide a way to create meaningful insights from raw sequence data. The results are often in form of ASCII files and plots (like in Tuxedo) that provide the basis for further downstream analysis where specific biological questions can be asked, like understanding the interaction between differentially expressed genes of interest through a network perspective, or clustering of genes according to their functional profiles, or inference of phylogenetic relationships etc. Tuxedo like pipelines are independent but essential cog in a larger bioinformatics workflows, and a compute framework must make provisions to provide analytical capabilities to the results produced by such pipelines. The nature of questions that we ask from computational biology perspective varies as the field itself, and so can be the computational requirements. In order to realise BaaS, it is essential that we provide provisions to manage heterogeneous compute resources in the target data-centric systems to cater to different computational needs that arise in downstream analysis.

4.1 Heterogeneous Systems

Heterogeneous Systems have several types of processing elements and memory hierarchies, in contrast to the traditional systems with single type of processing element and a fixed memory hierarchy. Performance and compute density is obtained by specialised hardware tailored to compute patterns commonly found in scientific applications [10] making them suitable for compute intensive tasks in bioinformatics studies. Accelerators, in particular, Graphical Processing Units (GPUs) becoming integral elements of heterogeneous systems. We are seeing increasing examples of new software in bioinformatics exploiting GPUs [11], including tasks like read-mapping as achieved via TopHat in our case study, and examples of phylogenetic inferences [13], network biology etc. [2] as required for Post-Tuxedo downstream analysis.

One of the consequences of using accelerators is the necessity of using different memory spaces and move data efficiently between them. The intricacies around system heterogeneity pose an extra burden on users, as they now have to control where and when code should be executed and data be moved. Sophisticated interconnect between hardware and middleware, like NVLINK, and supporting development ecosystems like gpuR [6] etc. are increasingly being applied in bioinformatics studies [11].

As the benefits of using accelerators like GPUs become more apparent to users, we anticipate the presence of accelerated code and dependences to accelerator toolchains to become increasingly more frequent in the application used and developed in bioinformatics. For an end-user, in a heterogenous environment, to use a Tuxedo like pipeline, this will require two major considerations: (i) complex specification of resources in the pipeline, and (ii) increased scheduling complexity.

Specification of Heterogeneous Resources in the Pipeline. The resource requirements for a tool within a pipeline needs to be communicated to the scheduler, so the schedular can determine where and how a given application in the pipeline can be launched. Where because the schedule needs to forward the workload to machines that possess that resource (e.g. not all nodes may have a GPU). In some cases a given resource can be a hard dependence, i.e. the application only works if that resource is present, or it can be an option dependency, i.e. the application would perform better if that resource is present. Further, it is not uncommon that an application requires the environment or one of its arguments to specify the number of resources available to drive the partition of the problem at hand. This can be tackled by the creation of CWL nodes with attributes known to the scheduler that would be a dependence to the application. E.g. for the CWL representation in Fig. 2, the node num_processor would be provision of special attributes so that the scheduler knows the type of processors the application requires and its number. If the number of processors does not affect the results, an attribute would mark this node so that the scheduler could set the number at runtime depending on the number of resources it has available at a given time.

Fig. 5.
figure 5

Scheduling example for a set of application in a heterogeneous system with 4 NVIDIA GPUs and 16 Power 8 CPUs per core. In a given node, applications should use resources unused by applications already running there. The resource allocation can span across more than one node. Performance optimisation may involve migrating work once resources belonging to an application that finishes are released so that communication between all resources used by the other applications is more efficient.

Scheduling Complexity. Deploying an as-a-service platform for bioinformatics, necessarily implies that in a given system, multiple pipelines, each one with multiple independent jobs, will be running at a given time, and the scheduler must ensure that the system resources are utilised to the fullest. In a heterogenous environment, it is important that resources in a given machine are shared by different applications. If a given application only use CPUs, preventing the use of GPUs in that machine by some other application would hurt occupancy and, therefore, waste resources. Virtualisation of resources is therefore a must in order to obtain the right partition of resources from a given machine to be used by an application as shown in Fig. 5. That will also facilitate migration of resources between nodes to improve locality in the set of resources bound by an application. Modern scheduling tools have evolved to recognise accelerators like GPUs as a system resource in its own right and will be part of the BaaS design.

5 Conclusions

Even though Life Sciences is a vast field, DNAs are the fundamental unit of life, and central to all the biological questions. Omics technologies aim to understand DNA and its products, and are rapidly changing biology into a data-intensive discipline. Large amount of data requires efficient compute infrastructure for data processing. The infrastructure must be contextualised according to the discipline it aims to support. Bioinformatics has its own needs due to the nature of datasets and the computational tasks involved. BaaS is an attempt to address some of the typical computational requirements that arise from a Bioinformatics study. Though BaaS is presented in this paper as a blueprint, it utilises some well understood ideas and is easy to implement. We anticipate, as the NGS technologies progress in future, we will see many variations of field experiments and the rapid rise in produced data. We are already seeing a flood of data in Single cell sequencing experiments that are fast becoming mainstream. As the technology improves, so will the audacity of the experimental questions being asked, and so will be the amount of data produced. In our opinion, in order to under the central dogma of biology through the prism of omics data, we will need clever algorithms running on a range of compute devices as envisioned in BaaS. The specifications for BaaS as mentioned in this paper are by no means complete, as there are several important issues we didn’t touch upon and would be useful for future improvements in the design. In this paper, we focussed only on the core functional aspect of BaaS as how jobs can be installed, scheduled and executed. To enhance the design of BaaS further, there must be considerations on the scope and role of its users and administrators, provisions for efficient data movement from a user’s location to the compute cluster, cost effectiveness and benefits to the end-user, provisions for data sharing, and several other issues specific to Bioinformatics domain.