The TrEase web service: inferring phylogenetic trees with ease

Phylogenetic inference is done regularly in many biological studies not focussed on the phylogeny itself, but which use phylogeny as a tool to infer hypotheses for the interpretation of laboratory experiments. However, phylogenetic inference is often performed at low standards in these studies, which can result in wrong interpretations. Using high-standard phylogenetic inference tools usually requires substantial methodological knowledge or is only possible with paid tools. To enable beginners, researchers for which phylogeny is just one tool of many, and scientists in biodiversity discovery a quick and easy access to current phylogenetic methods, the TrEase web service with an intuitive interface was developed. It offers a complete pipeline with commonly used phylogeny-related software, which can run the whole process of sequence acquisition, reference sequence search, alignment and phylogenetic inference with a single execution command once the necessary parameters have been selected from drop-down menus. It allows choosing alternate programmes for each step and also offers the flexibility to use any part of the pipeline independently. Along with providing a compact pipeline, this web service provides several functionalities to avoid manual intervention such as sorting sequences into the same orientation, cutting reference sequences, removal of redundant sequences and the possibility to choose reference sequences from top ‘species’ hits instead of top GenBank entry hits. All resulting trees and intermediate files are made available for download for subsequent use. Thus, the TrEase service offers a barrier-free entry into standard phylogenetic analyses. It is available at http://thines-lab.senckenberg.de/trease.


Introduction
Inferring phylogenetic relationships among organisms or genes of a certain gene family within an organism as accurately as possible is a prerequisite for meaningful evolutionary deductions from phylogenetic data.For investigating phylogenetic relationships, a series of tasks is often necessary, which may include several of the following steps after sequence data have been generated for one or several loci from one or several organisms.
• Fetching homologous sequences (reference sequences) from databases to increase the gene or taxon sampling.• Aligning the sequences, with or without the newly obtained dataset of the reference sequences, in a multiple sequence alignment (MSA), applying one of several algorithms available.• Editing the MSA (e.g. using refinement software to remove leading and trailing gaps or low complexity regions).
Section Editor: Tanay Bose 84 Page 2 of 6 • Conducting phylogenetic analyses based on the MSA using one or ideally several different algorithms.• Comparing the support values from different methods and producing a tree figure with a tree from a particular analysis and added support values from other analyses onto its common branching points.
There are several ways of performing the above tasks and most involve a mix of using several software packages and manual steps, as phylogenetic analysis tools in many user-friendly sequence editing tools are insufficient for detailed phylogenetic analyses.For several dedicated and accurate phylogenetic tools, graphical user interfaces (Mishra and Thines 2014;Silvestro and Michalak 2012) or web servers (Guindon et al. 2005;Loytynoja and Goldman 2010;Boc et al. 2012) are available.In addition, there are some web servers available on which pipelines (Lin et al. 2005;Dereeper et al. 2008) can be used to perform different parts of the analyses; a few also provide the possibility to build a customised pipeline from a list of software available on that server (Miller et al. 2015).
However, a web service is currently lacking on which all the steps outlined above could be integrated in a manner that would also allow users with only basic knowledge and bioinformatics skills to produce meaningful and solid phylogenetic inferences easily.Therefore, the phylogenetic analyses service TrEase was set up to provide a barrier-free entry into up-to-date phylogenetic analyses.The front end of TrEase is a single-page, form-based web service that provides important parameters for all the different steps towards the generation of a publication-ready phylogenetic tree, starting from a set of unaligned input sequences.
This resource note describes the TrEase web service and explains the use of different software tools as well as the handling of their output.In addition, it describes the additional functionalities of refined reference sequence picking and the experimental feature of merging support values from different phylogenetic tree generation softwares on a user-defined tree.

Input format
Currently, the TrEase web service deals only with nucleotide sequences.An extension to protein sequences will be integrated with the next release.The input format for the sequence data is preferably in fasta format.Either single or multiple fasta sequences can be provided as a single file or be pasted directly into the text field of the form.A multiple sequence alignment in fasta format can also be uploaded in case the user wants to do only phylogenetic tree inference on the server (in this case, the alignment option should be unselected).

Workflow summary for the TrEase server
The workflow of the TrEase server is summarised in Fig. 1 and is divided into five sections.
• After reading user input sequences and an initial assessment of the validity of the data, reference sequences are fetched from a locally stored and weekly updated sequence database, to synchronise with the NT database at NCBI (ftp:// ftp.ncbi.nlm.nih.gov/ blast/ db/), using either the blastn or the megablast algorithm from Blast+ tool (Camacho et al. 2009), according to the preference of the user.• After fetching reference sequences according to the preference of the user, the original input sequences, together with the non-redundant reference sequences, are subjected to multiple sequence alignment using either MUSCLE (Edgar 2004), MAFFT (Katoh and Standley 2013), or prank (Loytynoja 2014) according to the choice of the user.• In the next step, the resulting alignment can be refined by removing unaligned gaped regions and regions of low complexity using Gblocks (Castresana 2000).If this option is not chosen, it is recommended that the users download the alignment to trim leading and trailing gaps before continuing to the next step.• Subsequently, phylogenetic inference can be done using Minimum Evolution as implemented in FastTree2 (Price et al. 2010), Maximum Likelihood as implemented in RAxML (Stamatakis 2014) and Bayesian phylogenetic inference as implemented in MrBayes (Ronquist et al. 2012).• The confidence values of the different trees can finally be drawn onto a single tree chosen by the users, if this option is chosen.This support value merger is functional but still experimental and will be refined in upcoming releases.

Initial assessment and processing of the input data
The data that can currently be processed are nucleotide sequences, standard ambiguity codes are allowed.The maximum sequence length allowed is 3000 bases and the maximum number of sequences allowed is 1500.Gaps are allowed in the data in case a user-generated multiple-sequence alignment is uploaded.If a gaped multiple sequence alignment is provided for the analyses and the reference search option is chosen, gaps are removed before blastn or megablast search.If the assessment of the input data reveals data not meeting the abovementioned criteria, a redirection to the main page of the server is done, after stating the reason for the redirection.

Fetching reference sequences
The Standalone Blast + tool (Camacho et al. 2009) is used to fetch reference sequences from a locally stored nt (nucleotide) database, using either blastn or megablast algorithm.The user can choose how many references are to be fetched per query.For example, if the user chooses to acquire three reference sequences, but only one reference sequence per species, an inbuilt parser will pick only the first blast hit for a particular species and skip all other hits from the same species and move to the first hit from a different species and afterwards move forward until a hit from another species is found.In addition to the option to choose between the top three sequences and the top three species hits, there is an option to exclude environmental sequences during the blast search.Only blast hits that have an E-value of more than 10e −10 are retained in any case.After reference sequences have been fetched, the length of each sequence is checked and for the sequences that are more than 1.5 times longer than the query sequences, the reference sequences are trimmed from both ends or one of the ends to be at maximum 1.5 times longer than the query sequences (Figure S1).
In the case of trimmed sequences, the sequence name indicates the trimming points of the original sequence fetched.The orientation of each blast hit is checked and those with reverse complimentary match to the query are reverse complimented to match the orientation of the query.Duplicate sequences are removed from the blast hits using a Perl script and the information regarding the duplication is stored as a table in CSV format.The non-redundant final set of the reference sequences is then added to the original input fasta sequences and passed onto the next part of the pipeline.Parallelisation is implemented for the reference search using Blast when multiple sequences are provided, reducing the run-time to 1/30 th .The default setting for this part of the pipeline is the fetching of 3 blast top hits with an E-value cut-off of 10e −10 , using the megablast algorithm, and not excluding environmental sequences.

Multiple sequence alignment
There are three tools, MUSCLE (Edgar 2004), MAFFT (Katoh and Standley 2013), and prank (Loytynoja 2014) implemented in TrEase, with the versions regularly updated.MUSCLE does a progressive alignment of the sequences using  a log-expectation score profile function.MAFFT is based on the fast Fourier transform method that allows rapid detection of homologous parts of the sequences and works well with sequences having large insertions or extensions as well as distantly related sequences of similar length.Implementation of multithreading in MAFFT makes it faster in case of a large number of sequences.Prank is a probabilistic multiple sequence alignment tool that computes phylogeny-aware alignments.It has a specific option for performing codon alignments, which is not available in the other alignment tools.
The user can choose the software of his preference for multiple sequence alignment and also can set the important parameters for that software.The default tool for multiple sequence alignment is MAFFT in conjunction with the G-INS-i algorithm.

Refinement of the alignment
Gblocks (Castresana 2000), with its default parameters, is implemented for refining the MSA by removing gaped and poorly aligned regions (e.g.low complexity regions) from the alignment.By default, the refinement option is not selected for the pipeline and needs to be selected if the user prefers using it.As mentioned above, if this option is not chosen, it is recommended to download the alignment and to remove leading and trailing gaps before moving to the phylogenetic tree reconstruction.

Phylogenetic tree reconstruction
Three different phylogenetic analyses programmes, Fast-Tree2 (Price et al. 2010), RAxML (Stamatakis 2014) and MrBayes (Ronquist et al. 2012), are implemented in TrEase and their versions are regularly updated.FastTree, by default, implements Minimum-Evolution subtree-pruning-regrafting (SPRs) and Maximum-Likelihood NNIs, but in this pipeline the Maximum-Likelihood calculation is turned off allowing only Minimum-Evolution NNIs and SPRs calculations.This method is 100 to 1000 times faster than other algorithms for large alignments and is, thus, especially suited for preliminary analyses.RAxML is a widely used phylogenetic programme that uses the complex Maximum Likelihood algorithm in a memory-efficient way.MrBayes is a Bayesian inference programme that uses Metropolis-coupled Markov chain Monte Carlo calculations and infers phylogenies based on posterior probabilities.
The user can choose to use all or a subset of these tools for phylogenetic tree reconstruction.Both RAxML and MrBayes run on multiple threads to accelerate the analysis.Some of the most important parameter values can be selected by the user.By default FastTree2 and RAxML are run with 1000 bootstrap replicates and for MrBayes, one million generations are run.

Merging support values of phylogenetic trees
When the results of phylogenetic analyses are presented, if more than one phylogenetic tree reconstruction method has been selected, the option for showing support value from all trees onto one tree at the common branch points is implemented.The user needs to select at least the primary tree and a second tree to compare the branches to find out the common nodes.The merged support value is generated as a nine-digit number in the format 'XXXYYYZZZ', in the case of three selected trees, where XXX is the support value from the selected primary tree, YYY is the support value from second tree and ZZZ is the support value from the third tree.In case of two selected trees, the merged support value is in the format 'XXXYYY'.
When one of the trees is missing a support value at a common node, it is denoted by 000, and two-digit support values are preceded by a 0 to represent the values as three digits.The newick formatted tree with merged values can be viewed on FigTree (http:// tree.bio.ed.ac.uk/ softw are/ figtr ee/).From FigTree, the tree can be exported as a vector graphic, so that the support values can be manually arranged in the way preferred by the user e.g. by adding dashes or hyphens between the individual support values.

Computational efficiency of the TrEase pipeline
The TrEase pipeline has been tested on a server that has 64 AMD processors of CPU speed 2.4 GHz and 512 GB RAM.It is a stand-alone machine and its computational efficiency should not be compared to grid or cluster-based high-performance computational facilities.However, if the server load is too high to ensure efficient processing of incoming requests, additional servers will be made available to the pipeline.
To estimate the time taken for individual pieces of software in the pipeline, two sets of input files were prepared, one set with 300 LSU sequences of 1000 to 1500 bases in length selected from NR (non-redundant nucleotide) database from NCBI and the other set with 500 ITS sequences of a length of 600 to 800 bases from oomycetes.These two sequence sets were run separately on the TrEase server.The times taken for different processes are presented in Figure S2.

Architecture and availability
TrEase is accessible at http:// www.thines-lab.senck enberg.de/ trease and its backend is supported by HTML, PHP and Perl on an Apache web server.The results of the analyses on the server are stored for retrieval within ten days of submission of the job.Afterwards, all the results and related data are erased out automatically.

Conclusions
The TrEase web service was built to simplify the workflow for phylogenetic analyses, by reducing the need for manual inspection of intermediate results and format conversions, along with combining current phylogenetic software in a single web page easily accessible to the scientific community.On this server, after uploading or pasting the input sequences on the text box, a single click on the 'perform the assigned task' button produces a Maximum Likelihood phylogenetic tree using the input sequences and three reference sequences per input sequence.The TrEase web service has been beta-tested since mid-2014 after its presentation at a conference.As a result of user feedback, TrEase has been established in its current form and will be updated regularly to take up suggestions from users and to include more options.We hope that both students and researchers will benefit from the TrEase web service, as it offers an easy entry into conducting meaningful phylogenetic analyses also in areas of research in which phylogeny is otherwise often used as a minor tool, such as molecular genetics and physiology.