Background

International trade and consumer demand have increased the worldwide movement of plants and plant parts. At the same time, the global distribution and exchange of plant germplasm that support the improvement and expansion of agricultural and horticultural industries have also grown dramatically in recent years [1,2,3,4]. Imported plant germplasm must be thoroughly tested, and proper phytosanitary measures should be followed to minimize the risk of introduction of new pests and pathogens of quarantine relevance. Therefore, the development of comprehensive diagnostic methods to identify both known and unknown plant pathogens, as well as novel variants, is an important goal for testing plant material distributed at the global and national levels.

Quarantine centers, certification programs, and plant diagnostic clinics have been using traditional virus diagnostic techniques to conduct virus detection, which includes biological indexing (mechanical transmission using herbaceous indicator plants such as Chenopodium quinoa and Nicotiana tabacum, etc.), enzyme-linked immunosorbent assay (ELISA), polymerase chain reaction (PCR), and loop-mediated isothermal amplification (LAMP) [5, 6]. More recently, high-throughput sequencing (HTS), also known as next-generation sequencing (NGS) or deep sequencing has been used by diagnosticians and researchers for detecting and identifying plant pathogens, which has resulted in a steady increase in the identification of plant pathogens affecting various crops [2, 7,8,9,10,11,12,13]. Since it does not require a priori phytosanitary status knowledge of the sample, HTS offers certain advantages when compared to targeted-diagnostic techniques such as ELISA or PCR [10, 11, 14,15,16,17,18]. The use of HTS is now becoming a gold standard across continents after the International Plant Protection Convention (IPPC) recommended it as a diagnostic tool for phytosanitary purposes in 2019 [19]. More recently, the European and Mediterranean Plant Protection Organization (EPPO) released a standard for plant health diagnostics using HTS in 2022 [20].

HTS-based plant pathogen detection involves two major strategies: amplicon sequencing, which uses the power of PCR to amplify specific standardized genetic marker(s), such as 16S rRNA gene for bacteria [21] or unique genomic regions of virus and viroid genomes [22]; or shotgun sequencing, which captures the complete nucleic acids present in a sample [20, 23]. Amplicon sequencing is popularly used for identification and comparison of entire microbial communities while shotgun sequencing has wider applications for uncovering novel and emerging pathogens [23]. Currently, total RNA sequencing (RNA-seq) and small RNA sequencing (sRNA-seq) are the two most widely used HTS shotgun approaches for the detection of plant viruses and viroids [27, 28]. sRNA-seq is designed for virus detection based on the plant viral response mechanism [29] while total RNA-seq, which had traditionally been used for the analysis of the transcriptomic landscape of the host, is now also used for the detection of plant pathogens such as bacteria (including phytoplasma), fungi, viruses, and viroids [8, 30,31,32]. Although RNA-seq is commonly used for virus and viroid detection by plant virologists and in routine diagnostic applications such as post-entry quarantine or certification [11, 33], non-viral pathogens or pests can also be detected from the same RNA-seq dataset used for virus detection [15]. Because of its broad detection spectrum and novel pathogen detection ability, it has recently become an increasingly popular tool for pathogen detection. It has been used in many crops, such as wheat [34], grapevine [35], citrus [36], fruit trees [37], cucurbit [38], sugarcane [10], grasses [7, 38] etc. for viral pathogen detection.

One of the drawbacks of the RNA-seq approach is that it requires higher titer levels of the pathogen expressed in the hosts, which corresponds to a high number of pathogen-derived reads in the data [15]. The RNA-seq approach could miss pathogens with low titer. Another obstacle for using RNA-seq to detect pathogens is its demanding bioinformatic requirements for microbiome analysis. To our knowledge, there are few broadly accepted standard methods for RNA-seq data generation, processing, and analysis [25, 39, 40]. The most used analysis methods or pipelines can fall into three main categories: (i) mapping sequence reads directly to reference genomes from known pathogens such as Pathoscope [41] and CAMAMED [42]; (ii) assembling sequence reads and annotating contigs such as VirFind [43], VSD toolkit [44], VirusDetect [45], and Virtool [37]; and (iii) read-based taxonomic assignments such as Kaiju [46], Kraken2 [47], and Kodoja [48]. However, these methods or pipelines do not offer an integrated sequence read quality control, read assembly, pathogen reference mapping, and read classification to identify known pathogens and discover novel species, which is a common occurrence during plant virus detection. Therefore, an integrative and comprehensive pipeline is needed to detect the presence of a wide range of potential plant pathogens in a specimen.

Here we present Phytosanitary Pipeline (PhytoPipe), an integrative pipeline for plant pathogen identification using RNA-seq data. The pipeline combines current tools for HTS read quality control, the host read filtering, read assembly, contig annotation, reference mapping, and taxonomic classification. PhytoPipe is equally capable of identifying known bacteria (including phytoplasma), fungi, oomycetes, viruses, viroids, and possible novel viruses. Furthermore, the use of the Snakemake workflow management system [49] allows for an efficient and automated deployment on a local multicore computer, computing cluster, or a cloud environment.

Implementation

The PhytoPipe framework uses the Snakemake workflow management system [49] to organize sequence data processing tools. These tools have been organized into four distinct modules: reads preprocessing (Fig. 1A), reads classification (Fig. 1B), assembly-based annotation (Fig. 1C), and reference-based mapping (Fig. 1D). Each module summarizes the results into HTML or Krona reports and tables (Fig. 2). PhytoPipe can be easily set up in a Linux or Mac environment or run using the PhytoPipe docker image [50] (https://hub.docker.com/r/healthyplant/phytopipe) on a Linux, Mac, or Windows system. PhytoPipe requires at least 300 GB of RAM, 1 TB of fast-speed storage and multi-cores (> 32) parallel computing environment. The complete usage and user options are outlined on the GitHub wiki page https://github.com/healthyPlant/PhytoPipe/wiki.

Fig. 1
figure 1

Flowchart describing processes and steps performed by the PhytoPipe workflow. The pipeline integrates reads preprocessing (A), reads classification (B), assembly-based annotation (C), and reference-based mapping (D) into a single workflow. The entire protocol can be run starting from raw sequence data in the form of single- or paired-end FASTQ files. The results are summarized in an HTML report file, tables, and Krona plots for an expert interpretation

Fig. 2
figure 2

Example output from the PhytoPipe workflow. RNA-seq data from two apple samples (NCBI BioProject accession: PRJNA562540 [72]) processed by PhytoPipe show: A an HTML report showing results from different methods; B a MultiQC report for samples read quality; C a read mapping graph for a virus; D a Krona pie graph showing the taxonomic composition

Quality control

PhytoPipe can process sequence reads in the form of single- or paired-end FASTQ files and performs raw sequence reads cleaning in four steps: (1) removing host ribosomal RNAs (rRNA) with bbduk against the SILVA eukaryote ribosomal 18S and 28S RNA database [51], which is summarized by SortMeRNA [52]; (2) removing PCR duplicates with clumpify; (3) removing spike-in or positive controls (the default is pre-determined as PhiX) with BBSplit; and (4) removing and trimming low-quality reads, bases, and adapter sequences with Trimmomatic [53] (Fig. 1A). The tools used in steps 1, 2, and 3 are implemented in the BBTools suite [54]. The raw and clean-read qualities for a single sample are visualized by FastQC [55] and for batch samples by MultiQC [56]. PhytoPipe reports read numbers at each cleaning step, so the user can choose them to evaluate the wet lab work, such as rRNA depletion efficiency.

Read classification

PhytoPipe uses Kraken2 [47] to query reads against the NCBI nt database for the nucleotide-level classification. PhytoPipe also relies on Kaiju [46] to assign reads to taxa using the NCBI taxonomy and a microbial non-redundant database (nr + euk) of bacterial, viral, fungal, and other microbial eukaryotic proteins (Fig. 1B). These k-mer-based approaches classify sequences based on the presence and frequency of specific k-mers in the database [46, 57, 58]. They can discover low titer viruses or phytoplasma, which could be missed by assembly-based methods [29]. Besides the text report generated by the tools, the sequence profile and the metagenomic classification are also interactively visualized by multi-layered Krona pie charts for a given sample [59].

Assembly-based annotation

Prior to read assembly, host reads are usually subtracted for pathogen detection. Instead of mapping reads to the genome to remove host reads, PhytoPipe extracts possible pathogen-derived reads, including classified pathogen-derived (bacteria, fungi, oomycetes, viruses, and viroids) and unclassified reads from the Kraken2 classification using the modified script “extract_kraken_reads.py” in KrakenTools [60] (Fig. 1C). Then these reads are assembled with either SPAdes [61] or Trinity [62] de novo assembler. Trinity is used as default due to its robustness and its ability to better perform when dealing with low titer viruses. Assemblies are then evaluated with QUAST [63] followed by contig (length ≥ 200 nucleotides) annotation at the nucleotide level using blastn [64] against NCBI nt database and at the protein level using Diamond blastx [65] against NCBI nr database. PhytoPipe allows users to obtain the pathogen information in the blast results that are combined with pathogen taxonomy along with HTS read count assigned by Kraken2 and Kaiju. Blast searches against NCBI databases can be time-consuming (several days) depending on the user’s computing environment and the volume of the HTS data. Hence, the user who is just interested in the virus discovery can either use their own database or other alternative virus databases such as NCBI viral reference genomes and Reference Viral Databases (RVDB) protein version [66]. The user’s databases and the analysis parameters can be easily set up in the config file. Finally, the users could identify possible novel viruses based on Diamond blastx results and the ICTV criteria field [67].

Reference-based mapping

To further confirm virus discovery derived from HTS read classification and assembly-based annotation, viral reference genomes are collated before reference-based mapping by PhytoPipe (Fig. 1D). The clean reads are mapped to reference genomes by BWA-MEM [68]. The mapped read number and coverage are calculated by SAMtools [69] and a coverage graph is drawn using matplotlib [70] in Python. A consensus sequence is generated with BCFtools (including mpileup and consensus commands) [71] and is additionally annotated using blastn against the local NCBI nt database to filter non-pathogen sequences.

Results

The PhytoPipe output for each sample includes FastQC/MultiQC reports for HTS read quality assessment, Krona taxonomy pie charts for both Kraken2/Kaiju read classification and blastn/blastx results for contigs, and QUAST report for assembly evaluation. In addition, output also includes blastn/Diamond blastx search result tables, mapping statistics and coverage graphs for viruses and viroids, together with a summary report in HTML format (report.html). The final text report (report.txt) for viruses and viroids includes results from all samples, including the raw/clean-read length and count, read mapping information (reference names and related taxonomy, mapped read count, normalized read count (reads per kilobase of transcript per million mapped reads (RPKM)), percentage of mapped reads, percentage of viral genome covered, and mean coverage), NCBI blast results (blast E-value, blast identity, blast description), and the nucleotide sequence (contig or consensus sequence). A comprehensive sequence quality report includes a read quality table of raw read count, raw bases (Mbases) count, percentage of bases >  = Q30, mean of raw read quality score, percentage of rRNA, read count after removing duplicates, spike-in/control read count, read count after trimming, and the number of possible pathogen-derived reads used for assembly.

To show the PhytoPipe detection of microbes in the real plant RNA-seq data, two apple sample datasets from the study by Wright et al. [72] (host: Malus domestica, NCBI BioProject accession: PRJNA562540) were analyzed. Table 1 shows the microbe detection results. One fungus (Aureobasidium pullulans EXF-150) was found in SRX6762507, which could have been derived from the environment (e.g., water or soil). Fungal viruses (also known as mycoviruses) were also found in this sample (not listed). Three species of bacteria (Actinoplanes friuliensis DSM 7358, Bradyrhizobium sp. 170, and Steroidobacter denitrificans) were found in SRX6762511, which could be soilborne. The validated multiple apple viruses and viroids were also found by PhytoPipe in both samples: apple chlorotic leaf spot virus (ACLSV), apple hammerhead viroid (AHVd), apple mosaic virus (ApMV), apple rubbery wood virus 2 (ARWV2), apple stem grooving virus (ASGV), apple stem pitting virus (ASPV). PhytoPipe also found additional three ones: hop latent virus, hop latent viroid, and hop stunt viroid. Besides this output for all microbes, PhytoPipe has a specific report for viruses and viroids (report.txt), the Table 1 missed validated virus, apple green crinkle associated virus (AGCaV), is in the viral report. Figure 2 shows examples of the PhytoPipe output from this analysis. The HTML report contains summary results from different tools (Fig. 2A), the MultiQC report shows read quality (Fig. 2B), the read mapping graph shows genome coverage for virus and viroid genomes (in this case hop latent viroid: NC_003611) (Fig. 2C), and a Krona pie chart shows the taxonomic composition of the sample (Fig. 2D). The Krona pie chart also offers an interactive view of different pathogens present in the sample. Details of these result files are provided on GitHub (https://github.com/healthyPlant/PhytoPipe/tree/main/doc/test_report.zip).

Table 1 Microbes and pathogens detected by PhytoPipe in samples SRX6762507 and SRX6762511

To compare PhytoPipe with other plant virus detection pipelines, nine datasets corresponding to nine virus detection challenges from the Plant Health Bioinformatics Network (PHBN) VIROMOCK (https://gitlab.com/ilvo/VIROMOCK-challenge) [73] were analyzed. Dataset_2 was excluded since it was designed for mutation detection. Table 2 shows the results of the virus and viroid detection using four different pipelines. PhytoPipe could detect all expected viruses and viroids in the datasets and solved all the pre-determined challenges listed for these datasets. Pipelines Kodoja [29], Pathoscope [41], and Virtool [37] detected most of the known viruses at the species level but missed one to five viruses. Both Kodaja and Pathoscope failed to detect novel viruses. Virtool, on the other hand, could detect novel viruses and solve several challenges. True positive rate (TPR), false negative rate (FNR), and false discovery rate (FDR) are calculated by true positives (TP) (detected expected viruses), false positives (FP) (detected unexpected viruses), and false negatives (FN) (missed expected viruses). TPR = TP/(TP + FN), FNR = FN/(TP + FN) and FDR = FP/(FP + TP). The TPRs for the four pipelines are 100% (PhytoPipe) > 91% (Pathoscope) > 74% (Virtool) > 52% (Kodoja); The FNRs are 0% (PhytoPipe) < 9% (Pathoscope) < 26% (Virtool) < 48% (Kodoja); The FDRs are 39% (PhytoPipe) > 25% (Kodoja) > 22% (Pathoscope) > 19% (Virtool). PhytoPipe has the highest TPR and lowest FNR. It has the highest FDR since some of the identified viruses are not in the expected virus list. For example, citrus blight-associated pararetrovirus, citrus endogenous pararetrovirus, and cherry virus A in the dataset_1 (citrus sample); grapevine fleck virus, grapevine leafroll-associated virus 3, grapevine Kizil Sapak virus, and grapevine leafroll-associated virus 7 in the dataset_3 (grapevine sample); pistacia cryptic virus in the dataset_9 (pistachio sample). To determine whether they are true or false positives, confirmatory experiments such as the use of a PCR-based detection method might be necessary.

Table 2 Comparison of virus and viroid detection with different pipelines

To demonstrate the capabilities of PhytoPipe to detect bacteria, fungi, and oomycetes, twenty-two public RNA-seq sequence datasets (twelve bacteria, eight fungi, and two oomycetes) from the pathogen infected plants in the study by Haegeman et al. in 2023 [15] were analyzed using PhytoPipe. The results in Table 3 show that PhytoPipe detected confirmed pathogens in 18 out of 22 (81.8%) samples. PhytoPipe was not able to detect bacteria from four samples due to the low abundance of the pathogen-derived reads (reads per million reads(rpm) < 10 as reported by Haegeman et al.).

Table 3 PhytoPipe analyses results for RNA-seq samples with confirmed pathogen infection

To further assess the ability of the pipeline to detect plant pathogens, twenty-four simulated RNA-seq datasets (12 with and 12 without pathogens) were generated using ART [74]. Twelve datasets comprising 10 or 20M host reads from 12 crop genomes (apple, cassava, citrus, grapevine, maize, peanut, potato, rice, rose, soybean, sweet potato, wheat) were generated using ART and the subsample function of seqtk (https://github.com/lh3/seqtk). Another twelve spike-in datasets were composed of a host, two to three fungi/bacteria/oomycetes, and six to eight viruses/viroids for each crop with different quantities. The virus/viroid reads in the spike-in samples ranged from 30 to 35,250 with a coverage from 2 to 300X, and fungi/bacteria/oomycetes reads ranged from 877 to 2,185,250 with a coverage from 1 to 10X (Additional file 1: Table S1). The spiked pathogens didn’t show up in the results of 11 host datasets as expected, except citrus endogenous pararetrovirus in the citrus host. In addition, four unexpected viruses were found in the other three host datasets: rice tungro bacilliform virus and citrus exocortis viroid in rice, caulimovirus sp. in sweetpotato, and begomovirus-associated DNA-III in cassava. These results show that the negative control sample has an important role in the analysis. PhytoPipe detected all 79 spiked viruses/viroids that were expected with the high level of correlation between the simulated and observed reads (the majority were mapped reads; classified reads were for viruses without contigs) (R2 equals to 0.86) (Fig. 3A). Viruses missed by the assembly-based method were detected by the Kraken2 classification method. In case of citrus endogenous pararetrovirus, the number of observed reads was double the number of spiked ones. All 13 bacteria/oomycetes pathogens and 13 out of 15 fungi were detected by PhytoPipe at the species level despite the low correlation (R2 equals to 0.35) between the simulated and observed classified reads (Fig. 3B). The spiked reads classified by Kraken2 varied from less than 1% to 95% of the original ones. Eight pathogens (six fungi and two bacteria) had less than 10% reads classified at the species level and one of the spiked fungi Diplocarpon rosae had just three reads classified to this species. These results showed that the coverage and the database are also the keys for RNA-seq to detect pathogens. Although viruses and viroids can be detected with less than 1000 reads due to their smaller genomes, fungi, bacteria, and oomycetes require high titer because of the large genome size. In summary, PhypoPipe can detect spiked microbes in the simulated data with a high level of accuracy. For the virus detection, TPR, FNR and FDR of the pipeline are 100%, 0%, and 2.4%, respectively. Two unexpected viruses, rice tungro bacilliform virus and citrus exocortis viroid were detected in both rice and spike-in rice datasets, are counted as false positives. For the detection of bacteria, fungi, and oomycetes, if only the same species are treated as true positives, TPR, FNR and FDR of the pipeline are 97%, 14%, and 40%, respectively. FDR is high because the simulated genome coverage (1 to 10X) is comparatively low and many reads are classified into the same family or genus, not into the same species.

Fig. 3
figure 3

Comparison between spike-in and observed pathogen reads from simulated RNA-seq data. A 79 viruses/viroids from 12 crops. Observed reads were obtained by either mapping reads to a viral reference if viral contigs are annotated or are from Kraken2 classification. The high correlation between the spike-in and observed viral reads shows high detection ability of PhytoPipe. B 28 bacteria/fungi/oomycetes in 12 crops. Observed reads were obtained from Kraken2 classification

Discussion

To assess whether plant material is pathogen-free is critical for regulatory and biosecurity purposes. In contrast to the use of PCR and ELISA, which target a specific pathogen, RNA-seq can detect all potential pathogens in a plant sample. However, the results could be impacted by several variables, including the type of tissue used, the pathogen titer in the sample, the type of nucleic acid under analysis, the sequencing method employed, the type of analytical tools, and the reference databases used by each analytical tool. Virus detection pipelines can also vary greatly in their ability to detect known and novel viruses. For example, an assembly-based analysis could generate false-negative results when low titer viruses are present in the sample. This potentially high-risk scenario could be due in part to the low incidence of reads that does not allow the generation of sizeable contigs. Contrary to this, read classification-based methods can detect those viruses, but they could be below the threshold. On the other hand, to speed up the detection process, many virus detection pipelines only use viral databases. This could potentially result in some host sequences being annotated as part of a virus genome by blast if the threshold is not strict, or some host reads mapping to viruses for the reference-based mapping method if the tool parameters are less stringent. To address these limitations, we built PhytoPipe, which can integrate read classification, assembly-based annotation, and reference-based mapping methods for the detection of known plant pathogens as well as novel viruses. The possible known viruses and viroids are identified by the overlap of the results from read classification at the nucleotide level by Kraken2 and contig blastn against the viral nucleotide database. The selected candidates are further filtered by the viral reference genome coverage from the reference-based mapping and their consensus sequence annotations from blastn against NCBI nt database. Furthermore, PhytoPipe can identify possible new viruses by the overlap of the results between read classification at the protein level by Kaiju and contig Diamond blastx against the viral protein database. Moreover, PhytoPipe visualizes the sequence profile as Krona pie charts which the users can use to determine the presence of any pathogens (Fig. 2D).

When a method is used for the analysis of HTS data, a threshold is either set by the user or by the developer. There is a trade-off between the true positive rate and the false positive rate for different thresholds. The more stringent the threshold is, the higher could be the number of false negatives, and vice versa. For example, if > 100 reads is used as a read number threshold for the Kraken2 read classification in the sample SRX6762507 (Table 1), three apple viroids (hop stunt viroid, hop latent viroid, and apple hammerhead viroid) could result in false negatives. Furthermore if > 60% viral genome coverage, which is defined as the percentage of a viral genome/segment covered by reads, is used to filter viruses in the PhytoPipe report of SRX6762507, three false positive viruses (apricot latent virus, turnip vein-clearing virus, and ribgrass mosaic virus) can be classified as non-detectable. Therefore, the use of the most appropriate thresholds to reduce false positives and false negatives is critical for an accurate diagnostic. PhytoPipe combines three methods to minimize the detection of false outcomes. If a virus is found by both the classification and assembly-based annotation methods, and its genome coverage (from the read mapping method) is above 15%, PhytoPipe reports this virus as a positive. This viral genome coverage threshold could be low and cause more false positives. Moreover, PhytoPipe has a user-defined pathogen file, named monitorPathogen.txt, which can be modified based on the user's pathogens of interest, such as the nationwide pest priorities. In case these monitored pathogens are missed in the report to cause false negatives, they can be fished out from the Kraken2 result, even with a low read number support (e.g., < 100 read). However, these could have higher probabilities of being false positives. In this circumstance, the user’s knowledge about the pathogen is key to decide whether there is a need for validation or not. Since the pathogen titer is often variable and depends on various factors, such as biotic, abiotic, and methodology challenges to the sample, it is difficult to establish a general threshold that can be applied for all pathogens. The PhytoPipe user can then easily filter the output for all the potential organisms present in the sample using the report files. For example, when looking into the virome of a sample, mapped reads, viral genome coverage, and blast E-value in the report (report.txt) can be used for filtering. For bacteria and fungi, classified read number, contig number, and longest contig size in the summary files (sample.blastnt.summary.txt and sample.blastnr.summary.txt) can be used. Users can determine the present of a positive diagnostic based on their expertise and by the inclusion of negative controls and taxonomy information from PhytoPipe. For further regulatory action, a wet-lab validation should be required for pathogens identified by HTS to minimize the risk of reporting false positives. Moreover, the plant pathogens that are detected despite a low number of reads, plant pathologists may also need to use alternative methods such as PCR, for the confirmatory diagnostics.

A phytosanitary inspection of plant material not only discovers the presence of known pathogens in the sample but also ensures a thorough examination to conclude that the material is free of detectable pathogens. However, it is difficult to determine whether a plant material is clean without a negative control. An HTS report for a plant sample could have many organisms, such as a plant, insects, plant fungi or environmental fungi (e.g., from soil, water, or air), plant bacteria or environmental bacteria, fungal viruses (or mycoviruses), plant viruses, insect viruses, etc. The negative control datasets can efficiently help to filter out unrelated organisms. On the other hand, inaccurate read classifications (with a low read number) or inaccurate contig annotations (with a high blast E-value) are also in the report since similar sequences in the databases are used for classification or annotation, and NCBI nt and nr databases are not curated and frequently updated. With negative control datasets, reasonable thresholds can be set up and wrong annotations can be removed. On the contrary, the absence of negative control datasets could result in a higher number of false positives or even a wrong report for a sample. Therefore, the negative control datasets are needed for reducing errors in the analysis to generate a reliable result.

Ribosomal RNA removal is a key step during the library preparation process when using total RNA as the initial sample source for diagnostics. rRNA could take up to 30–50% of the reads in a sequenced library with inefficient rRNA removal whereas efficient removal of rRNA can lead to samples with less than 5% of the reads. The PhytoPipe ribosomal RNA removal step uses SILVA Eukaryote ribosomal RNA (18S and 28S) database to evaluate the library prep and remove possible host rRNAs. To evaluate whether this step impacts the microbe detection, two analyses of 12 bacteria RNA-seq data from the study by Haegeman et al. [15] were done using Kraken2. One analysis was for raw reads using rRNA database SILVA 138_1 SSU while the other one was for rRNA removed reads using NCBI nt database (PhytoPipe method). Our results showed that the rRNA removal step had no impact on bacterial pathogen detection (Additional file 2: Table S2). Surprisingly, more reads were classified using the PhytoPipe method as compared to the one without rRNA removal.

The available pipelines such as Kodoja [29], VirFind [23], VSD toolkit [24], VirusDetect [25], and Virtool [26] are limited to viral pathogen detection. To the best of our knowledge, PhytoPipe is the first pipeline that has the potential to detect possible microbial pathogens in a plant using RNA-seq data. Besides the wider scope of pathogen detection, PhytoPipe offers many unique features that other pipelines lack (Table 4). First, a k-mer-based read classification method in PhytoPipe can detect the viruses missed by assembly- and mapping-based methods. Second, PhytoPipe can remove and report the percentage of host ribosomal RNA reads in the sample that facilitates the quality assessment of the wet lab work. Third, PhytoPipe subtracts host reads using the k-mer method (based on the Kraken2 result) for all plants instead of host genome mapping for only plants having a complete host genome available. Fourth, PhytoPipe summarizes results from the comprehensive analysis as an HTML report (report.html) that helps users to determine the presence of possible pathogens in the sample. Lastly, the Snakemake workflow management system allows seamless integration and scaling of the pipelines to server, cluster, and cloud environments.

Table 4 Unique features of PhytoPipe in comparison with other pipelines

Despite our best effort to design a comprehensive phytopathogen detection method, PhytoPipe has a few limitations. First, the pipeline is mainly developed for the RNA sequencing data generated using the Illumina sequencing platform. However, it can be extended to other platforms after adjusting for a different platform-related set of tools. Second, the pipeline uses several different databases, which takes time, effort, and significant data storage space for initial construction and maintenance. Third, the pipeline can be used to identify the low titer viruses, which have a low read number supporting the output analysis. Such viruses could be false positives in the report, but this is a limitation found in all the pipelines investigated in this study. Fourth, the pipeline reports possible bacteria and fungi from classification and assembly-based annotation (see Table 1). The end-users of the results need to perform further validation of the results and use their knowledge of plant pathogens to make a regulatory decision. Lastly, only simulated datasets were used to evaluate PhytoPipe detection of all possible microbes in plant samples. This evaluation is still limited without large real datasets.

Conclusions

PhytoPipe is a reliable and robust bioinformatic framework for detecting plant pathogens (bacteria (including phytoplasma), fungi, oomycetes, viroids, and viruses (including novel ones)) using RNA-seq data. Pathogens are identified with HTS read classification and assembly-based annotation methods and further validated with reference-based mapping for viruses. PhytoPipe combines different tools and databases to verify the findings from various angles. Although PhytoPipe is uniquely designed for plant pathogen discovery, it can also be used for the detection of other organisms. A summary HTML file includes metagenomic information from HTS read classification, contig blast annotation, and reference-based mapping for downstream analysis and visualization. An organized running folder keeps detailed information for the user to explore the run information and results. PhytoPipe is implemented using Snakemake, which can take advantage of multicore CPUs in a local, cluster, or cloud environment. The PhytoPipe docker image can be used on a Linux, Mac, or Windows system.

The source code for PhytoPipe is distributed under a BSD-3 license and is freely available at https://github.com/healthyPlant/PhytoPipe. Software documentation available at https://github.com/healthyPlant/PhytoPipe/wiki describes the pipeline's installation, usage, and testing using the published RNA-seq data from NCBI SRA.