What Galaxy can do for you

For most users, this high level of accessibility is the most welcome and immediate benefit of Galaxy, but this is just the beginning. Just as letting untrained people loose with construction tools does not lead to well-built houses, empowering users to run analysis tools does not in itself lead to sound results. The deeper goal of computational robustness demands that the results and methods of an analysis can stand scrutiny, and Galaxy provides its most significant capabilities in this domain. To start, Galaxy automatically records the inputs, tools, parameters and settings used for each step in an analysis, thereby ensuring that each result can be exactly reproduced and reviewed later.

This record has important short- and long-term consequences. In the short term, different parameters and thresholds can be explored, and once the analysis is done, the Galaxy record will eliminate any ambiguity as to which result used which settings. In the long term, the Galaxy history is invaluable if an unforeseen follow-up analysis is performed. For example, I have had the all too common experience of mistakenly trying to analyze targeted sequencing results by mapping the reads to build 37 of the human genome, when the coordinates for the design referenced an earlier build, leading to subtle changes and confusing results. If I had been working inside Galaxy, the exact history would have been automatically recorded, and this mistake could have been easily avoided, saving hours of wasted effort.

Beyond automatically providing provenance, Galaxy makes it easy for users to annotate each step with a human-readable description on interactive web documents called Galaxy Pages. Galaxy Pages enhance transparency far beyond the raw command list, as they can be used to communicate the intent of each step with written descriptions, figures and even embedded videos and screencasts. Transparency, more so than reproducibility, is essential for verifying computational analysis, because in the extreme case a programmatic or logical error will lead to exactly the same erroneous result time after time. A well-annotated Galaxy Page helps the analyzer to catch such errors by enabling them to narrate the logical process of the pipeline, potentially with the same rigor as a mathematical proof. Users can then publish Galaxy Pages as supplementary material for a publication to document the exact stages of the analysis.

After an analysis has been carefully customized and debugged for one dataset, Galaxy users can repeatedly apply that command history on different data. Each time the workflow is run, the same sequence of tools will be executed with the same parameters as before but with the new data. This way a Galaxy user can develop a rich, organized catalog of reusable workflows rather than starting from scratch each time or trying to navigate a collection of ad hoc analysis scripts. In addition, users can share their workflows and Galaxy Pages on the central Galaxy website, tapping into the collective intelligence of the Galaxy community and improving the field for everyone.

Galaxy's goals are ambitious, and the project is not without limitations, but it is now the leading platform for computational analysis of DNA sequence data. The standard installation is loaded with analysis tools for trimming and preparing raw sequences [2], mapping sequences to reference genomes [3, 4], cataloging variations [5] and statistically analyzing the results. I've heard of Galaxy users developing and running new analyses in hours that would have previously taken weeks of effort at the command line. Already, several papers [68] have been published in leading journals in which the analysis was completed within Galaxy and augmented with detailed Galaxy Pages, allowing other researchers to study and understand the methods used in greater detail than before. The public repository of Galaxy Pages, workflows, and datasets is poised to become one of the most valuable bioinformatics resources online and the first stop for analysts facing new challenges.

Use with care

Multiple studies have shown that software developers are much more productive when using higher-level abstractions such as modern programming languages, sophisticated software libraries and richer development environments [9]. However, these abstractions sometimes also cause new problems because they hide potentially important details of when they are suitable. Similarly, Galaxy users will become more productive working at a higher level, but also face new dangers of this kind.

Consider the case of a casual user discovering and running a workflow in the Galaxy repository for analyzing differential expression within an RNA-seq experiment. Even if the workflow was scrutinized and published for one dataset, the user could reach a disastrous conclusion if they failed to realize that the workflow depends on a particular library preparation or requires a certain type of technical replicate that their experiment did not use. Galaxy verifies that file formats are compatible and makes analysis accessible, but until systems for analyzing semantic dependencies of this kind are available, Galaxy cannot make analysis fully automatic and intelligent. The very popular R/Bioconductor package [10] recognizes and addresses this issue by deliberately not offering a single prepackaged analysis 'wizard' for common tasks, but instead offers a selection of choices and requires users to consider their options carefully. This is the most practical approach for Galaxy as well, but creates its own usability problems, especially the additional burden placed on the user to select the appropriate tool or workflow.

Power users may find Galaxy too restrictive because not every software package is available within it, especially cutting-edge software for novel analyses, and the graphical interface does not offer the same flexibility as a scripting environment or R/Bioconductor. However, the other benefits of Galaxy, especially its productivity, provenance tracking and transparency, may outweigh these limitations for analysis tasks leading to publication. Until massively parallel and powerful computational resources are available, all users face the frustration of working with very large datasets, where computation can run for days or weeks. Galaxy users would do best to install it on their own servers or utilize the new cloud-computing-based version that can be dynamically provisioned on demand.

A final problem with any computation-based project is whether it can enable long-term reproducibility. For example, none of the software packages I purchased for my first computer in the 1980s works today, and it is not clear if any package I use today will work in 20 years. Galaxy mitigates this problem by using open standards and building a community of users and developers beyond a single funding source, but no one knows whether future web browsers and operating systems will work on today's standards. This challenge is beyond the scope of Galaxy alone, and journals and the publication archives need to actively research how to maintain legacy software accessibility in the future, perhaps through the use of virtualized machine images for interactive or enhanced media supplementary material.