Background

Transcriptome data are fast becoming an important and cost effective component of phylogenetic studies [1-5]. The rapid fall in sequencing prices has contributed to the growing number of phylogenetic studies that integrate data from genomes and transcriptomes, often referred to as “phylogenomic” analyses. There is wide recognition that adding data from a larger number of genes is necessary to address many open phylogenetic questions, though of course additional gene sequences alone will not be sufficient to resolve them all [6-8].

The primary impediments to wider adoption and further improvement of phylogenomic methods are the complexity of the analyses and the lack of integrated tools to conduct them. Each phylogenomic study requires many steps, the vast majority of which concern matrix construction rather than phylogenetic analysis itself. These steps include raw data filtering, assembly, identification of ribosomal RNA, selection of transcript splice variants, translation, identification of homologous sequences, identification of orthologous sequences, sequence alignment, phylogenetic analysis, and summary of results. Implementing a phylogenomic analysis is not just a matter of executing available tools for each of these steps. Among other challenges, results must be summarized across multiple steps, detailed records must be kept of all analysis steps, data files often need to be reformatted between analyses, and computational load must be balanced according to the available resources.

Because phylogenomic studies are complex and have been manual or semi-automated, they are difficult to implement and explicitly describe, and require extensive technical effort to reproduce. These problems can make it difficult to evaluate results, integrate data across studies, expand analyses, or test the impact of alternative analysis approaches. In addition, manual analyses often include many subjective decisions that may impact the final results.

Some higher-level pipelines have addressed subsets of phylogenomic analyses. These tools include PartiGene [9], a pipeline to aid in the assembly and annotation of Sanger transcriptome data collected across a diversity of species, and SCaFoS [10], a semi-automated tool for the curation of super matrices from previously assembled transcriptomes. No existing tool, however, can execute a full phylogenomic analysis of modern sequence data.

We addressed these needs by developing Agalma, an automated phylogenomics workflow. Using Agalma, an investigator can conduct complete phylogenomic analyses, from raw sequence reads to preliminary phylogenetic trees, with a small number of high-level commands. The results are accompanied by detailed reports that integrate diagnostic information across data sets and analysis steps. In a first pass with Agalma, the investigator conducts the analysis under default settings. The investigator then consults the reports to consider how best to optimize the analyses, and easily re-executes them with updated settings to produce new matrices and preliminary trees. The investigator can then analyze the optimized matrices with external phylogenetic inference tools not already included within Agalma to explore other factors, such as model sensitivity.

Implementation

We built Agalma with BioLite [11], a generic bioinformatics pipeline framework. BioLite profiles memory and CPU use, tracks provenance of all data and analyses, and centralizes diagnostic reporting. Agalma is a modular workflow comprised of helper scripts and a series of pipelines. Each pipeline is made up of stages that call internal analysis functions (many implemented with the help of Biopython [12]) and wrappers from the BioLite Python module. The wrappers invoke command-line tools, which include external bioinformatics tools such as the Bowtie2 aligner [13] and Trinity assembler [14], as well as several C++ tools from BioLite. All calls to these external programs are logged, as are the version numbers of the programs and a hash of the binary so that it is possible to detect differences in compiler settings. This information about calls to external programs is available to the user via the agalma diagnostics command. We provide automated installation tools for Agalma, BioLite, and all required third party software. As a result, complete installation of Agalma and all dependencies requires only a handful of commands on OS X and Ubuntu, and the scripts facilitate installation on many other UNIX compliant systems as well. Installation details are provided in the README and INSTALL files included with BioLite and Agalma.

All interactions with Agalma are with the command agalma. The command agalma assemble, for example, provides access to the assemble pipeline. These commands include built-in help via the -h option, which provides details on all arguments and the parameters for external packages that are exposed to the user through Agalma. The command agalma -h provides help for the top-level wrapper. The command agalma assemble -h, for example, provides help for the assemble command. Additional details regarding the interface can be found in the README file.

The first step for analyzing each data set, whether it consists of raw reads to be assembled or of previously assembled gene predictions, is to catalog the data. This creates a database entry that includes sample metadata and the paths to the data files. Agalma has built-in support for transcriptome assembly of pair-end Illumina data only in FASTQ format. When analyzing public data, raw reads and associated metadata can be imported directly from the NCBI Sequence Read Archives (SRA) using the command sra import. This command downloads the reads for a given SRA accession number (experiment, study, sample, or run), converts them into FASTQ format, and populates the catalog with the corresponding data paths and metadata. When previously assembled gene predictions are used, the assembly files in FASTA format can be directly catalogued.

There are several distinct tasks subsequent to cataloging the data-sequence assembly, loading the genes into the database, and phylogenetic analysis. These tasks are described in detail in the README and TUTORIAL files provided with Agalma, and are briefly summarized below.

Assembly

The pipeline transcriptome runs an assembly from read filtering through assembly and preliminary annotation. In a typical analysis, transcriptome would be run once for each species for which raw Illumina transcriptome data are available. The transcriptomepipeline executes the following sub-pipelines, which can also be run individually:

  • sanitize filters raw paired-end Illumina data and randomizes the order of reads (maintaining read order between paired files) to facilitate subsequent subsetting. Reads are discarded if they do not meet a specified mean quality threshold, if they contain Illumina adapter sequences, or if the base composition is highly skewed (if any base represents either <5% or >60% of the sequence). This pipeline also generates FastQC [15] summaries of read quality.

  • insert_size uses subassemblies and Bowtie2 mapping to estimate the mean and variance of the insert size (i.e., the length of the fragment between the sequencing adapters). This information provides important feedback on the success of sample preparation, and is also used in some downstream analysis steps.

  • remove_rrna removes most reads that are derived from ribosomal RNA (rRNA). Removing rRNA in advance of assembly can reduce both the number of chimeric transcripts and the time required for assembly. This pipeline first assembles multiple subsets of reads. A range of subset sizes is used since the optimal number of reads for assembling a particular rRNA transcript depends upon multiple factors, including the fraction of reads that originate from rRNA and the uniformity of coverage across the rRNA transcripts (which can vary greatly, depending on how denatured the samples were prior to fragmentation). rRNA transcripts are then identified by blast comparison of these subassemblies to an included dataset of known rRNA sequences. The entire set of reads is then compared to the rRNA transcripts that are identified from the subassemblies, and any reads that Bowtie2 maps to them are removed. A plot of the distribution of reads across exemplar rRNA transcripts is shown to help evaluate rRNA assembly success. The top hit in the NCBI nt database is also provided as an independent check on sample identity and to help spot potential contaminants. The fraction of reads that derive from rRNA is also reported to aid in improving library preparations.

  • assemble filters the reads at a higher quality threshold and assembles them. Assemblies can be conducted under multiple protocols (such as multiple assemblers, or the same assembler under different settings). This pipeline can also assemble multiple subsets of different numbers of reads, which provides perspective on how sequencing effort impacts assembly results. The default assembler is Trinity [16]. The wrapper we have included in Agalma for running Trinity makes two improvements over the wrapper script that comes with Trinity. First, we have added a filter in between the Chrysalis and Butterfly stages to remove components that are smaller than the minimum transcript length parameter passed to Butterfly, since running Butterfly on these components will not yield a transcript. For the five assemblies in our test data set, this reduces the number of Butterfly commands from roughly 100,000 to 60,000. Second, we have replaced the ParaFly utility that is used for concurrent execution of the Butterfly commands with the GNU parallel tool [17] because it has better parallel efficiency. ParaFly executes the commands concurrently, but in blocks, so that the time to execute a block is the runtime of the slowest individual command. The runtimes can vary greatly because of variance in transcript length and complexity. In contrast, parallel load balances the commands across the available processors.

  • postassemble uses blastn against the rRNA reference sequences to identify rRNA transcripts (these could include low abundance transcripts, such as parasite contaminants, that were not removed as reads by remove_rrna) and screens against the NCBI UniVec database to identify vector contaminants (such as protein expression vector contaminants in the sample preparation enzymes, which we have encountered in multiple samples). The longest likely coding region per transcripts is identified and annotated using blastp against the NCBI SwissProt database. Reads are mapped to the assembly and all transcripts quantified using RSEM [18], and the splice variant with the highest expression is selected as the exemplar transcript for each gene for downstream analyses. It also produces a coverage map that helps assess the distribution of sequencing effort across genes. Finally, it uses blastx to compare all the transcripts against the NCBI SwissProt database to establish which are similar to previously known proteins.

Following these steps, the investigator can inspect the assembled data directly or load them into the Agalma database to prepare them for phylogenetic analysis.

Load genes into the local Agalma database

Subsequent phylogenetic analyses require that all gene sequences to be considered are loaded into the local Agalma database. The load command takes care of this process. In a typical analysis, load is executed once for each dataset that has been assembled by the transcriptome pipeline described above, and once for each set of gene predictions from external sources (e.g., externally assembled 454 transcriptome data or gene predictions from genome sequencing projects).

Phylogenetic analysis

Once assemblies for multiple species are loaded into the local Agalma database, the user carries out a phylogenomic analysis by consecutively executing the following pipelines:

  • homologize allows the user to specify which datasets to include in a particular phylogenetic analysis. It then uses an all-by-all tblastx search to build a graph with edges representing hits above a stringent threshold, and breaks the graph into connected components corresponding to clusters of homologous sequences with the Markov Clustering Algorithm (MCL) tool [19].

  • multalign applies sampling and length filters to each cluster of homologous sequences, and uses another all-by-all tblastx within each cluster to trim sequence ends that have no similarity to other sequences in the cluster (these could include, for example, chimeric regions). The sequences of each cluster are aligned using MACSE [20], a translation-aware multiple-sequence aligner that accounts for frameshifts and stop codons. Multiple sequence alignments are then cleaned with GBLOCKS [21]. Optionally, the alignments can be concatenated together to form a supermatrix.

  • genetree uses RAxML [22] to build a maximum likelihood phylogenetic tree for each cluster of homologous sequences. Gene trees can be filtered according to mean bootstrap support, which eliminates genes that have little phylogenetic signal [23] and reduces overall computational burden. This filter can be applied prior to running treeprune (described below), which has the added advantage of restricting ortholog selection to only well-supported gene trees. All options available in RAxML can be passed as optional arguments. If the input is a supermatrix consisting of concatenated orthologs, it builds a species tree.

  • treeprune identifies orthologs according to the topology of gene phylogenies, using a new implementation of the method introduced in a previous phylogenomic study [24]. It uses DendroPy [25] to prune each gene tree into maximally inclusive subtrees with no more than one sequence per taxon. Each of these subtrees is considered as a set of putative orthologs. The pruned subtrees are re-entered as clusters into Agalma’s database.

After treeprune, the user can build a supermatrix and a preliminary maximum likelihood species tree with RAxML. These steps, which include rerunning multalign and genetree on the orthologs, are detailed in the Agalma TUTORIAL file. Agalma produces a partitions file that describes which regions of the supermatrix correspond to which genes. The user can then proceed with more extensive phylogenetic analyses of the supermatrix using external phylogenetic inference software of their choice (only RAxML is included with Agalma at this time). As the alignments for each gene are also provided, the investigator can also apply promising new approaches that simultaneously infer gene trees and species trees [26].

Test data and analyses

A small set of test data is provided with Agalma. It consists of 25,000 100bp Illumina read pairs for the siphonophore Hippopodius hippopus, a subset of 72-74 gene sequences assembled for each of five siphonophores, and a subset of 40 gene predictions from the Nematostella vectensis genome assembly. These data were chosen because they run relatively quickly and enable testing of most commonly used features.

These test data serve several purposes. They allow a user to validate that Agalma is working correctly, and users are strongly encouraged to run this test with the command agalma test right after installation. This test analysis takes about 20 minutes on a 2-core 2 GHz computer. The test data also serve as the foundation for the example analysis described in the TUTORIAL file. For the developer, the agalma test command serves as a regression test to check if changes break existing features. We routinely run this test in the course of adding and refactoring code. The test data also serve as a minimal case study for developing new features without needing to first download and curate data.

Results and discussion

In addition to the small test data sets included with Agalma, here we present an example analysis of larger data sets from seven species. Though most phylogenetic analyses would include more taxa than this simple example case, the size of the dataset for each species is typical for contemporary phylogenomic analyses. This seven-taxon data set consists of new raw Illumina reads for five species of siphonophores, Abylopsis tetragona, Agalma elegans, Nanomia bijuga. Physalia physalis, Craseoa sp., and gene predictions for two outgroup taxa, Nematostella vectensis[27] and Hydra magnipapillata[28], produced by previous genome projects. mRNA for the five siphonophore samples was isolated with Dynabeads mRNA DIRECT (Life Technologies) and prepared for sequencing with the TruSeq RNA Sample Prep Kit (Illumina). The sample preparation was modified by including a size selection step (agarose gel cut) prior to PCR amplification. Analyses were conducted with Agalma at git commit dc549d23 and BioLite at git commit 025fe65e (versions 0.3.4 with patches).

We deposited the new data in a public archive (NCBI Sequence Read Archive, BioProject PRJNA205486) prior to running the final analysis. A git repository of the scripts we used to execute the example analyses is available at https://bitbucket.org/caseywdunn/dunnhowisonzapata2013. These scripts download the data from the public archives, execute the analyses, and generate the analysis reports. All of the figures presented here are taken from the analysis reports generated by Agalma. This illustrates how a fully reproducible and open phylogenomic analysis can be implemented and communicated with Agalma. These scripts can be used as they are to repeat the analyses. They could also be modified to try alternative analysis strategies on these same data, or they could be adapted to run similar analyses on different data.

Assembly

The tabular assembly report (index.html in Additional file 1) summarizes assembly statistics across samples, and links to more detailed assembly reports for each sample. For the example analysis, this summary indicates, among other things, that the fraction of rRNA in each library ranged from 0.4% to 27.2% and the insert sizes were on average 266 bp long. The detailed assembly reports have extensive diagnostics that pertain to sample quality and the success of library preparation. As an example, Figure 1 shows several of the plots from the detailed assembly report for Agalma elegans, the siphonophore after which our tool is named. The distribution of sequencing effort across transcripts (Figure 1a) and the size distribution of transcripts (Figure 1b) are typical for de novo Illumina transcriptome assemblies.

Figure 1
figure 1

Transcriptome assembly statistics for Agalma elegans . (a) Cumulative read coverage, sorted from the most frequently sequenced exemplar transcripts to the least frequently sequenced transcripts. Most reads map to a small fraction of the genes, as is typical for a transcriptome sequencing project. (b) The size distribution of the assembled transcripts, for all transcripts and for only those transcripts that have a blastx hit to a protein in the Swissprot database. Most small transcripts do not have blastx hits and likely include some non-coding RNA and many assembly errors. These are removed prior to phylogenetic analysis. See Additional file 1 for further assembly diagnostics.

Phylogenetic analyses

The Agalma phylogeny report includes a plot of the number of sequences considered at each stage of the analysis (Additional file 2). In the example analysis, the step that removed the most sequences was the first step of homology evaluation in the homologize pipeline (Figure 2). This reduction is due to low similarity to other sequences and clusters with poor taxon sampling. The next major reduction occurred in cluster refinement in multalign. This reduction is largely due to the elimination of clusters that failed the taxon sampling criteria, and reflects uncertainty regarding the homology of some sequences and sparse sampling of some homologs. The next major reduction in the number of genes occurred in treeprune. These reductions are due to both uncertainty regarding orthology and poor sampling of some ortholog groups. The preliminary species tree for the example analysis (Figure 3) is congruent with previous analyses of siphonophore relationships [29].

Figure 2
figure 2

The number of gene sequences under consideration at each stage of super matrix construction. Most gene sequences are eliminated in the first step of homology evaluation (the parse_edges stage of the homologize pipeline), due to low similarity to other sequences. Of the remaining sequences, many are eliminated during multiple sequence alignment (the multalign pipeline) in clusters failing taxon sampling and cluster size criteria reflecting uncertainty in homology across species, and during orthology evaluation (the treeprune pipeline) mainly reflecting poor sampling of some ortholog groups. See Additional file 2 for further diagnostics regarding matrix construction.

Figure 3
figure 3

The preliminary maximum likelihood phylogeny resulting from the example analysis. This unrooted tree was inferred from the protein supermatrix under the W A G+Γ model.

Resource utilization

Phylogenomic analyses are computationally intensive. Detailed information about resource utilization helps investigators plan resources for projects and balance computational load more efficiently. It is also critical for the optimization of the analyses, and can help guide design decisions. For each analysis, Agalma produces a resource utilization plot that displays the time and maximum memory used by external executables (Figure 4). The peak memory use, and the longest-running step, in the transcriptome pipeline was the call to Trinity during the assemble stage.

Figure 4
figure 4

A profile of computational resource utilization for the transcriptome pipeline. This plot is from the report for the Agalma elegans assembly. Although annotation of transcripts (the calls to blastx and blastp during the postasemble stage) showed the highest processing time (top row), assembly (the call to Trinity during the assemble stage) displayed the highest memory use and it took the longest real time to finish (bottom row).

Conclusions

A distinction is sometimes drawn between manual approaches that enable close user inspection of data and results, and automated approaches that isolate the user from their results. This is a false dichotomy-automating analyses and examining the results closely are not mutually exclusive. Automated analyses with detailed diagnostics provide the best of both worlds. The user has a very detailed perspective on their analysis, and the added efficiencies of automation leave the investigator with far more time to assess these results. Automation also allow improvements made in the context of one study to be applied to other studies much more effectively.

For a study to be fully reproducible, both the data and the analysis must be described explicitly and unambiguously. The best description of an analysis is the code that was used to execute the analysis. By automating phylogenomic analyses from data download through matrix construction and preliminary phylogenetic trees, Agalma enables fully reproducible phylogenomic studies. This will allow reviewers and readers to reproduce an entire analysis exactly as it was run by the authors, without needing to re-curate the same dataset or rewrite the analysis code.

There are alternative approaches to many of the steps in a phylogenomic analysis presented here. There are, for example, multiple tools that identify orthologs according to different methods and criteria [30, 31]. Agalma is a general framework and can be expanded to include these additional methods, and directly compare them in the context of a complete workflow that is consistent in all other regards.

Availability and requirements

Project name: Agalma

Project home page: https://bitbucket.org/caseywdunn/agalma

Operating system(s): Linux, Mac OS X

Programming language: Python

Other requirements: BioLite, and the dependencies listed in the BioLite INSTALL file

License: GNU

Author’s contributions

CWD and MH conceived of the original design of Agalma, and together with FZ made most design decisions. MH, CWD, and FZ implemented and tested the software and wrote the manuscript. MH did most of the coding. MH and FZ conducted the computational analyses. All authors read and approved the final manuscript.