Background

miRNA is a class of non-coding endogenous small RNA that post transcriptionally regulates target genes [1]. miRDeep-P [2] is one of the most commonly used computational plant miRNA identification tool, which is based on the miRDeep [3] algorithm.

The most challenging problem in identifying novel plant miRNA is to find a suitable genomic region as a miRNA precursor candidate (to test whether it forms hairpins) because the majority of precursor miRNA in plants are between 100-200 bp [4], which is much longer than those in animals. Approaches using a shorter miRNA precursor may result in false negatives if the miRNA is longer and more variable than the predicted precursor region. Conversely, using a longer candidate precursor region to test whether it forms a hairpin structure may result in a non-complimentary match for the mature miRNA within the candidate precursor miRNA. Thus, in miRPlant, after small RNA sequencing reads are mapped to the genome, genomic regions around mapped reads are extended by 200 bp to determine whether they form hairpin structures. To ensure detection of short plant miRNA, we also scan 100 bp regions to see if we can detect a hairpin. This strategy can detect bona fide miRNAs that would otherwise be missed if only the longer (200 bp) precursor candidate length was used.

The strategy for determining the precursor region is different between miRDeep-P and miRPlant. miRDeep-P determines the precursor region based on the genomic region having overlapping reads, while miRPlant determines a precursor region based on the mature miRNA region (or highest expressed read). The latter strategy can reduce the number of false negative results [5, 6], as it guarantees that the mature miRNA is located at the end of one arm of the stem loop.It is important that biologists with basic computer skills can easily use RNAseq tools in order to broaden research within this field. Thus, miRPlant was developed using the platform independent computer language Java. A Graphical User Interface (GUI) is employed whereby a complete pipeline analysis of raw data input is achieved in a few clicks of buttons: (.fastq files) - > mapping (.bam files) - > miRNA identification, expression, and secondary structure display - > mRNA target prediction. To further streamline accessibility of miRPlant, the tool does not require any third party tool. miRPlant also has a detailed but concise data output display that can be exported for publication in different file formats such as eps, pdf and svg (Figure 1). miRPlant images are generated dynamically.

Figure 1
figure 1

Output display of predicted miRNA. The read location and number of reads are shown relative to the precursor hairpin structure. The red sequence represents the mature miRNA.

Implementation

miRPlant operations can be divided into the following stages:

  1. i.

    filter out reads if their length is out of the 10-23 bp range, or which have a read-quality below the criteria that is set by user.

  2. ii.

    aggregate exact reads into one.

  3. iii.

    map aggregated reads to the genome reference without mismatch. miRPlant uses the Java-coded bowtie [7] alignment algorithm. BAM format is used to store mapped reads. Please note that the attribute “XS” in the BAM file is used to record the copy number of the read as introduced by miRDeep*.

  4. iv.

    gather sequences in the reference genome flanking the RNAseq read (precursor miRNA region) to determine whether the genomic region forms a hairpin structure using the RNA secondary structure algorithm [8].

  5. v.

    use the miRDeep model to calculate the score for each predicted miRNA to measure the strength of the prediction. A higher score equates to a higher probability that the predicted miRNA is true.

The miRPlant interface enables users to customize parameters since different plant species may have different miRNA biogenesis [2] (Figure 2). The default precursor miRNA length is set to 200 bp. Here the precursor length represents the length between the mature miRNA and the mature star miRNA; the two flanking sequences are excluded. miRPlant generates six output files similar to miRDeep*. Since the precursor length of plant miRNA is much longer than that of animals, the distance between the mature miRNA and mature star miRNA may be very long, which may result in the formation of an internal loop. Therefore, miRPlant allows for internal loops. The default minimum loop (including the distance from loop ends to the mature or star mature miRNA) size is 25 bp. In predicting mature miRNA, miRPlant requires less than 10% (max inconRead Ratio option in GUI) of reads falling out of the predicted miRNA and star mature miRNA sequence. In miRDeep, RNAseq reads in the loop are counted as being consistent, but plant miRNA have very long loops. Thus, we exclude reads located within the loop region. The other parameters are the same as with miRDeep*.

Figure 2
figure 2

Parameter settings for miRPlant. Adapter sequences need to be replaced as appropriate. Data processing by miRPlant depends on the extension of the input file. Mapping and identification is performed if the input file extension is “.fastq” or “.fa”. Only identification is performed if the file extension is “.bam”. Output “.result” files are shown after clicking “submit”.

Results and discussion

miRPlant has been tested on two rice datasets [9]. Both miRPlant and miRDeep-P employ the miRDeep score calculation, with miRPlant having better performance than miRDeep-P (Table 1), largely because miRPlant uses a flexible method to form the precursor candidates from the genomic region surrounding RNAseq reads. We set a minimum score of four when using miRPlant. A detailed summary of results can be found in Additional file 1 using GEO access number GSM278571 and GSM278572 for the RNAseq datasets.

Table 1 Comparison table

To further confirm the advantaged of miRPlant, we have extended this analysis to three more species (Arabidopsis thaliana, Medicago truncatula and Prunus persica) comprising 16 small RNA sequencing datasets (Detailed information in Additional file 2). To compare the two tools, we rank the predicted miRNAs in descending order of score for each tool, and then take the top 100 miRNAs from miRPlant and miRDeep-P for our comparison. We show that miRPlant consistently outperforms these other tools in all samples (Table 2, Additional files 3 and 4).

Table 2 Comparison table (ATH, MTR, PPE)

Conclusions

miRPlant is modelled off miRDeep* [5] for use with plant small RNA sequencing data. We have integrated all third party tools such as genomic mapping and RNA secondary structure prediction [8] into a Java library, which is seamlessly integrated into miRPlant.

Availability and requirements

Project name: miRPlant.

Project home page:http://www.australianprostatecentre.org/research/software/mirplant.

Operating system (s): Windows, Linux, Mac OS.

Programming language: Java.

Other requirements: JRE.

License: GNU General Public License.

Any restrictions to use by non-academics: None.